droath/laravel-text-chunker
Flexible Laravel text chunking for AI/LLM apps. Split content into smaller chunks by characters, tokens, sentences, or markdown-aware rules. Fluent, strategy-based API ideal for fitting token limits, RAG pipelines, and custom domain splitting.
Install the package via Composer, then start chunking text immediately using the provided facade. No service provider registration is needed—Laravel auto-discovers it. Your first use case will likely be splitting text for an LLM API (e.g., OpenAI) where token limits matter: use TextChunker::strategy('token')->size(500)->chunk($longText). For simpler cases, strategy('character') offers predictable, fixed-size chunks. Check config/text-chunker.php after publishing to set defaults (e.g., default strategy, sentence abbreviations, token model). Read the Basic Usage section in the README first—it covers all common patterns in under 5 minutes.
TextChunkerManager and chain with validation/normalization. Use token strategy with gpt-4 or gpt-3.5-turbo depending on model pricing and context windows.overlap(20) with sentence or markdown strategies to retain continuity across embeddings. This is critical when chunk boundaries split key semantic units.markdown strategy—it avoids breaking code blocks, lists, or headers mid-element. Combine with size(100) and overlap(15) for balanced retrieval granularity.WordStrategy, paragraph-based, or section-aware) by implementing ChunkerStrategyInterface and register via extend() in a service provider or config.->size(250)->overlap(10)) across requests without side effects.token strategy relies on yethee/tiktoken, which caches encodings per model. Memory usage spikes if chunking many long texts concurrently—consider chunking in batches and clearing PHP arrays.overlap(20) adds 20% of previous chunk content—not 20% of current chunk size. Expect overlapping text to slightly increase total output size (especially with short, dense chunks).size(100) is set, the block stays intact—even if it violates size constraints. This preserves integrity but may produce uneven chunk sizes.chunk() Time: Misconfigurations (e.g., invalid overlap, missing size) throw exceptions only when chunk() runs—not when building the chain. Wrap calls in try/catch for ChunkerException.sentence strategy’s abbreviation list is case-sensitive. 'Dr.' won’t match 'dr.' unless you manually add both. Consider normalizing input text first.HasOverlap trait is provided but not enforced—implement overlap manually if your strategy requires it. The trait only supplies the logic, not automatic application.start_position and end_position are UTF-8 safe and 0-indexed. Use them to map chunks back to source locations (e.g., for highlighting or source attribution in RAG). Avoid string functions that assume single-byte characters.How can I help you explore Laravel packages today?