clue/utf8-react
Async UTF-8 validation and sanitization for ReactPHP. Provides streaming transforms to detect invalid byte sequences and convert or strip broken input, helping keep network and file streams safely UTF-8 in event-driven PHP applications.
Start by installing the package via Composer:
composer require clue/utf8-react
The primary entry point is the Utf8 class, which provides static methods for validating and normalizing UTF-8 data. Your first use case is likely validating incoming stream data — for example, from an HTTP request or WebSocket message — to prevent malformed UTF-8 from corrupting your application logic.
Basic example:
$stream = new React\Stream\ThroughStream();
$utf8 = new Clue\React\Utf8\Utf8();
$stream->on('data', function ($chunk) use ($utf8) {
$valid = $utf8->validate($chunk);
// Handle validation errors or proceed
});
Check the README.md and examples/ directory in the repository for immediate usage patterns. Start with examples/validate.php and examples/sanitize.php.
Stream sanitization pipeline: Wrap insecure input sources (e.g., React\Http\Server, React\Socket\Connection) with Utf8::sanitize() or Utf8::validate() to normalize or reject invalid UTF-8 before downstream processing.
Example:
$connection->on('data', fn($data) => $connection->write($utf8->sanitize($data)));
Buffered streaming validation: Use Utf8::createBuffer() for long-running or fragmented input (e.g., large files, continuous logs), which handles partial multibyte characters across chunks:
$buffer = $utf8->createBuffer();
$stream->on('data', fn($chunk) => $buffer->write($chunk));
$buffer->on('data', fn($validChunk) => process($validChunk));
Graceful error handling: Pair validation with custom error callbacks to log or reject malformed input:
$utf8->validate($data, function ($error) {
error_log("Invalid UTF-8: $error");
// optionally drop or sanitize
});
Middleware pattern: In ReactPHP HTTP apps, create a middleware that wraps body parsing to ensure all payloads are UTF-8-safe before reaching route handlers.
Partial multibyte chunks: Never assume a chunk boundary aligns with character boundaries. Always use Utf8::createBuffer() for streams where data arrives in pieces — direct validation of raw chunks can fail on split characters (e.g., mid-emoji).
Silent sanitization: sanitize() replaces invalid sequences with U+FFFD (�), which may silently hide data loss. Prefer validate() for strict validation when invalid data is unacceptable.
Empty or null chunks: The library handles empty strings safely, but avoid passing null — validate input before calling methods.
Non-string inputs: All methods expect strings. Cast integers, floats, or arrays explicitly to strings to avoid unexpected behavior (e.g., strval($data) or '' . $data).
No BOM handling: The package does not strip or expect a BOM. Strip it manually (ltrim($data, "\xEF\xBB\xBF")) if required.
Performance: Micro-benchmarks show near-zero overhead for valid UTF-8. For high-throughput apps, reuse buffers (via createBuffer()) rather than instantiating new instances per chunk.
Testing tip: Use tools like random_int() + utf8_encode() to generate edge-case test data (e.g., overlong encodings, surrogates, out-of-range codepoints) — the examples folder includes a few test vectors.
How can I help you explore Laravel packages today?