droath/laravel-text-chunker
Flexible Laravel text chunking for AI/LLM apps. Split content into smaller chunks by characters, tokens, sentences, or markdown-aware rules. Fluent, strategy-based API ideal for fitting token limits, RAG pipelines, and custom domain splitting.
A Laravel package that provides flexible, strategy-based text chunking capabilities for AI/LLM applications. Split text into smaller segments using character count, token count, sentence boundaries, or markdown-aware strategies with a fluent, Laravel-friendly API.
Perfect for:
Install the package via Composer:
composer require droath/laravel-text-chunker
The package will automatically register itself via Laravel's auto-discovery.
Optionally, publish the configuration file:
php artisan vendor:publish --tag="text-chunker-config"
This will create a config/text-chunker.php file where you can customize
default settings:
return [
// Default strategy to use when none is specified
'default_strategy' => 'character',
// Strategy-specific configurations
'strategies' => [
'token' => [
// Default OpenAI model for token encoding
'model' => 'gpt-4',
],
'sentence' => [
// Abbreviations that should not trigger sentence breaks
'abbreviations' => ['Dr', 'Mr', 'Mrs', 'Ms', 'Prof', 'Sr', 'Jr'],
],
],
// Register custom strategies here
'custom_strategies' => [
// 'my-strategy' => \App\TextChunking\MyCustomStrategy::class,
],
];
Split text at exact character count boundaries:
use Droath\TextChunker\Facades\TextChunker;
$text = "Your long text content here...";
$chunks = TextChunker::strategy('character')
->size(100)
->chunk($text);
foreach ($chunks as $chunk) {
echo "Chunk {$chunk->index}: {$chunk->text}\n";
echo "Position: {$chunk->start_position} to {$chunk->end_position}\n";
}
Split text by OpenAI token count (perfect for API optimization):
use Droath\TextChunker\Facades\TextChunker;
$text = "Your long text content here...";
$chunks = TextChunker::strategy('token')
->size(500) // 500 tokens per chunk
->chunk($text);
// Use different OpenAI model for encoding
$chunks = TextChunker::strategy('token', ['model' => 'gpt-3.5-turbo'])
->size(500)
->chunk($text);
Supported Models:
gpt-4gpt-3.5-turbotext-davinci-003Split text at sentence boundaries:
use Droath\TextChunker\Facades\TextChunker;
$text = "First sentence. Second sentence. Third sentence.";
$chunks = TextChunker::strategy('sentence')
->size(2) // 2 sentences per chunk
->chunk($text);
// Custom abbreviations
$chunks = TextChunker::strategy('sentence', [
'abbreviations' => ['Dr', 'Mr', 'Mrs', 'Ph.D']
])
->size(3)
->chunk($text);
Preserve markdown structure when chunking:
use Droath\TextChunker\Facades\TextChunker;
$markdown = <<<'MD'
# Heading 1
Some content here.
```php
function example() {
return "code block";
}
```
- List item 1
- List item 2
MD;
$chunks = TextChunker::strategy('markdown')
->size(100) // Target size in characters
->chunk($markdown);
// Markdown elements (code blocks, headers, lists, blockquotes, horizontal rules)
// are never split in the middle, even if they exceed the chunk size
Add percentage-based overlap between chunks to maintain context (ideal for RAG systems):
use Droath\TextChunker\Facades\TextChunker;
$text = "Your long text content here...";
$chunks = TextChunker::strategy('character')
->size(100)
->overlap(20) // 20% overlap between chunks
->chunk($text);
// Each chunk will include 20% of the previous chunk's content
Overlap works with all strategies:
Each chunk is returned as an immutable value object with metadata:
$chunks = TextChunker::strategy('character')->size(100)->chunk($text);
foreach ($chunks as $chunk) {
$chunk->text; // The chunk text content
$chunk->index; // Zero-based index (0, 1, 2, ...)
$chunk->start_position; // Character offset in original text (inclusive)
$chunk->end_position; // Character offset in original text (exclusive)
}
Instead of the facade, you can inject the manager:
use Droath\TextChunker\TextChunkerManager;
class MyService
{
public function __construct(
protected TextChunkerManager $chunker
) {}
public function processText(string $text): array
{
return $this->chunker
->strategy('token')
->size(500)
->overlap(10)
->chunk($text);
}
}
Create your own chunking strategies by implementing the
ChunkerStrategyInterface:
<?php
declare(strict_types=1);
namespace App\TextChunking;
use Droath\TextChunker\DataObjects\Chunk;
use Droath\TextChunker\Concerns\HasOverlap;
use Droath\TextChunker\Contracts\ChunkerStrategyInterface;
class WordStrategy implements ChunkerStrategyInterface
{
use HasOverlap; // Optional: for overlap support
public function chunk(string $text, int $size, array $options): array
{
$words = explode(' ', $text);
$chunks = [];
$index = 0;
$position = 0;
foreach (array_chunk($words, $size) as $wordChunk) {
$chunkText = implode(' ', $wordChunk);
$chunkLength = mb_strlen($chunkText);
$chunks[] = new Chunk(
text: $chunkText,
index: $index++,
start_position: $position,
end_position: $position + $chunkLength
);
$position += $chunkLength + 1; // +1 for space
}
return $chunks;
}
}
Option A: Via Configuration
Add to config/text-chunker.php:
return [
'custom_strategies' => [
'word' => \App\TextChunking\WordStrategy::class,
],
];
Option B: At Runtime
use Droath\TextChunker\Facades\TextChunker;
use App\TextChunking\WordStrategy;
TextChunker::extend('word', WordStrategy::class);
$chunks = TextChunker::strategy('word')->size(50)->chunk($text);
Option C: In a Service Provider
use Droath\TextChunker\TextChunkerManager;
use App\TextChunking\WordStrategy;
public function boot(TextChunkerManager $chunker): void
{
$chunker->extend('word', WordStrategy::class);
}
The package provides a fluent, chainable API:
TextChunker::strategy(string $name, array $options = []) // Select strategy
->size(int $size) // Set chunk size
->overlap(int $percentage) // Set overlap (0-100)
->chunk(string $text) // Execute and return chunks
Method Details:
strategy(string $name, array $options = []): Select chunking strategy
'character', 'token', 'sentence', 'markdown'['model' => 'gpt-4'] for token strategy)size(int $size): Set target chunk size (required)
overlap(int $percentage): Set overlap between chunks (optional)
chunk(string $text): Execute chunking and return array of Chunk objects
ChunkerException on validation failuresarray<int, Chunk>All validation is deferred until the chunk() method is called:
use Droath\TextChunker\Facades\TextChunker;
use Droath\TextChunker\Exceptions\ChunkerException;
try {
$chunks = TextChunker::strategy('character')
->size(100)
->overlap(150) // Invalid: must be 0-100
->chunk($text);
} catch (ChunkerException $e) {
// Handle validation error
echo $e->getMessage(); // "Overlap percentage must be between 0 and 100"
}
Common Exceptions:
"Chunk size must be set before calling chunk()""Chunk size must be greater than zero""Overlap percentage must be between 0 and 100""Text cannot be empty""Unknown chunking strategy: xyz. Available strategies: character, token, sentence, markdown""Unsupported model: xyz"composer test
Run with coverage:
composer test-coverage
Format code with Laravel Pint:
composer format
Run static analysis with PHPStan:
composer analyse
Please see CHANGELOG for more information on what has changed recently.
Please see CONTRIBUTING for details.
Please review our security policy on how to report security vulnerabilities.
The MIT License (MIT). Please see License File for more information.
How can I help you explore Laravel packages today?