yethee/tiktoken
PHP port of OpenAI tiktoken for fast tokenization. Get encoders by model or encoding name, encode text to token IDs, with default vocab caching and configurable cache dir. Optional experimental FFI lib mode (tiktoken-rs) for better performance on larger texts.
Install via Composer:
composer require yethee/tiktoken
First Use Case: Tokenize a prompt before sending to OpenAI API to validate length:
use Yethee\Tiktoken\EncoderProvider;
$provider = new EncoderProvider();
$encoder = $provider->getForModel('gpt-4'); // Auto-selects correct encoder
$tokens = $encoder->encode("Your user input here...");
if (count($tokens) > 4096) { // GPT-4 context limit
throw new \RuntimeException("Prompt too long");
}
Key Starting Points:
EncoderProvider - Central class for model-aware encodersgetForModel('model-name') - Get encoder by OpenAI model nameget('encoding-name') - Get encoder by explicit encoding (e.g., 'p50k_base')encode(string $text) - Core method returning token IDsConfiguration:
TIKTOKEN_CACHE_DIR=/path/to/cache$provider->setVocabCache('/custom/path')1. Token Validation Middleware
// app/Http/Middleware/ValidatePromptLength.php
public function handle(Request $request, Closure $next)
{
$encoder = app(EncoderProvider::class)->getForModel('gpt-4');
$tokens = $encoder->encode($request->prompt);
if (count($tokens) > config('ai.max_tokens')) {
return response()->json(['error' => 'Prompt too long'], 400);
}
return $next($request);
}
2. Token-Aware Prompt Builder
// app/Services/PromptBuilder.php
class PromptBuilder
{
public function __construct(private EncoderProvider $provider) {}
public function build(string $userInput, string $model): string
{
$encoder = $this->provider->getForModel($model);
$tokens = $encoder->encode($userInput);
// Truncate if needed
if (count($tokens) > 3000) {
$tokens = array_slice($tokens, 0, 3000);
$userInput = $encoder->decode($tokens);
}
return "User: $userInput\nAssistant:";
}
}
3. Batch Processing with Chunking
// app/Jobs/ProcessDocument.php
public function handle()
{
$encoder = app(EncoderProvider::class)->getForModel('text-embedding-3-small');
$text = file_get_contents('large_document.txt');
// Process in chunks (simplified - see Gotchas for chunking)
$chunkSize = 1000; // Tokens
$chunks = array_chunk($encoder->encode($text), $chunkSize);
foreach ($chunks as $chunk) {
$this->generateEmbedding($encoder->decode($chunk));
}
}
Service Container Binding
// config/app.php
'providers' => [
// ...
Yethee\Tiktoken\TiktokenServiceProvider::class,
],
// Then inject via constructor
public function __construct(private Yethee\Tiktoken\EncoderProvider $tiktoken) {}
Model-Specific Encoders
// app/Providers/AppServiceProvider.php
public function boot()
{
$this->app->bind('gpt4.encoder', function () {
return app(Yethee\Tiktoken\EncoderProvider::class)
->getForModel('gpt-4');
});
}
Caching Strategies
// Cache encoder instances for performance
public function getEncoder(string $model): Yethee\Tiktoken\Encoder
{
return cache()->remember("tiktoken.encoder.$model", now()->addHours(1), function() use ($model) {
return app(Yethee\Tiktoken\EncoderProvider::class)
->getForModel($model);
});
}
1. Cache Invalidation
php artisan cache:clear$vocab = $provider->getVocabLoader()->load('custom.vocab', 'expected_checksum');
2. Token Counting Accuracy
getForModel()) for accurate counts$model = 'gpt-4';
$encoder = $provider->getForModel($model);
assert($encoder->getEncodingName() === 'cl100k_base', "Wrong encoder for $model");
3. Chunking Limitations
encodeInChunks() is not implemented in current version (as of 1.0.0)$tokens = $encoder->encode($text);
$chunkSize = 1000;
foreach (array_chunk($tokens, $chunkSize) as $chunk) {
$decoded = $encoder->decode($chunk);
// Process chunk
}
4. Performance Quirks
composer bench to compare before enabling LibEncoder5. Model Support Gaps
<|endofprompt|>src/Vocab/VocabLoader.php1. Custom Vocabularies
// Load custom vocabulary
$vocabLoader = $provider->getVocabLoader();
$customVocab = $vocabLoader->load('path/to/custom.vocab', 'checksum');
// Use with encoder
$encoder = new Yethee\Tiktoken\Encoder\NativeEncoder($customVocab);
2. LibEncoder Setup
// Initialize LibEncoder (do this once at app startup)
Yethee\Tiktoken\Encoder\LibEncoder::init(__DIR__.'/vendor/yethee/tiktoken/libtiktoken_php.so');
// Force lib mode for all encoders
$provider = new Yethee\Tiktoken\EncoderProvider(true);
3. Debugging Tokenization
// See what tokens correspond to which text
$tokens = $encoder->encode("Hello world!");
$tokenTexts = array_map(fn($id) => $encoder->decode([$id]), $tokens);
print_r($tokenTexts);
4. Memory Management
// In a route or command
$provider->getForModel('gpt-4'); // Forces vocab load
5. Environment-Specific Config
// config/tiktoken.php
return [
'cache_dir' => env('TIKTOKEN_CACHE_DIR', sys_get_temp_dir()),
'use_lib' => env('TIKTOKEN_USE_LIB', false),
'lib_path' => env('TIKTOKEN_LIB_PATH', null),
];
6. Handling Encoding Errors
try {
$tokens = $encoder->encode($text);
} catch (Yethee\Tiktoken\Exception\InvalidText $e) {
// Handle invalid UTF-8 or other encoding errors
Log::error("Tokenization failed: {$e->getMessage()}");
throw new \RuntimeException("Invalid input text");
}
1. Custom Encoder Implementation
// Implement Yethee\Tiktoken\Encoder interface
class MyCustomEncoder implements Yethee\Tiktoken\Encoder
{
public function encode(string $text): array { /* ... */ }
public function decode(array $tokens): string { /* ... */ }
public function getEncodingName(): string { /* ... */ }
}
// Register with provider
$provider->registerEncoder('my_encoding', new MyCustomEncoder());
2. Vocab Loader Extensions
// Extend vocab loading logic
$loader = $provider->getVocabLoader();
$loader->addVocabSource('s3', function($uri) {
return file_get_contents("s3://$uri");
});
3. Token Filtering
// Filter tokens before processing
$tokens = $encoder->encode($text);
$filtered = array_filter($tokens, fn($id) => $id !== 0); // Skip padding
How can I help you explore Laravel packages today?