Getting Started

Start by installing the package via Composer:

composer require clue/utf8-react

The primary entry point is the Utf8 class, which provides static methods for validating and normalizing UTF-8 data. Your first use case is likely validating incoming stream data — for example, from an HTTP request or WebSocket message — to prevent malformed UTF-8 from corrupting your application logic.

Basic example:

$stream = new React\Stream\ThroughStream();
$utf8 = new Clue\React\Utf8\Utf8();

$stream->on('data', function ($chunk) use ($utf8) {
    $valid = $utf8->validate($chunk);
    // Handle validation errors or proceed
});

Check the README.md and examples/ directory in the repository for immediate usage patterns. Start with examples/validate.php and examples/sanitize.php.

Implementation Patterns

Stream sanitization pipeline: Wrap insecure input sources (e.g., React\Http\Server, React\Socket\Connection) with Utf8::sanitize() or Utf8::validate() to normalize or reject invalid UTF-8 before downstream processing.
Example:
```
$connection->on('data', fn($data) => $connection->write($utf8->sanitize($data)));
```
Buffered streaming validation: Use Utf8::createBuffer() for long-running or fragmented input (e.g., large files, continuous logs), which handles partial multibyte characters across chunks:
```
$buffer = $utf8->createBuffer();
$stream->on('data', fn($chunk) => $buffer->write($chunk));
$buffer->on('data', fn($validChunk) => process($validChunk));
```

Graceful error handling: Pair validation with custom error callbacks to log or reject malformed input:

$utf8->validate($data, function ($error) {
    error_log("Invalid UTF-8: $error");
    // optionally drop or sanitize
});

Middleware pattern: In ReactPHP HTTP apps, create a middleware that wraps body parsing to ensure all payloads are UTF-8-safe before reaching route handlers.

Gotchas and Tips

Partial multibyte chunks: Never assume a chunk boundary aligns with character boundaries. Always use Utf8::createBuffer() for streams where data arrives in pieces — direct validation of raw chunks can fail on split characters (e.g., mid-emoji).
Silent sanitization: sanitize() replaces invalid sequences with U+FFFD (�), which may silently hide data loss. Prefer validate() for strict validation when invalid data is unacceptable.
Empty or null chunks: The library handles empty strings safely, but avoid passing null — validate input before calling methods.
Non-string inputs: All methods expect strings. Cast integers, floats, or arrays explicitly to strings to avoid unexpected behavior (e.g., strval($data) or '' . $data).
No BOM handling: The package does not strip or expect a BOM. Strip it manually (ltrim($data, "\xEF\xBB\xBF")) if required.
Performance: Micro-benchmarks show near-zero overhead for valid UTF-8. For high-throughput apps, reuse buffers (via createBuffer()) rather than instantiating new instances per chunk.
Testing tip: Use tools like random_int() + utf8_encode() to generate edge-case test data (e.g., overlong encodings, surrogates, out-of-range codepoints) — the examples folder includes a few test vectors.

Utf8 React Laravel Package

Getting Started

Implementation Patterns

Gotchas and Tips