yethee/tiktoken
PHP port of OpenAI’s tiktoken tokenizer. Get encoders by model name, encode text to token IDs, and cache vocab files for speed. Optional experimental Rust/FFI “lib mode” for faster encoding of medium/large texts.
composer require yethee/tiktoken
use Yethee\Tiktoken\EncoderProvider;
$provider = new EncoderProvider();
$encoder = $provider->getForModel('gpt-3.5-turbo-0301');
$tokens = $encoder->encode('Hello world!');
// Returns: [9906, 1917, 0]
$supportedModels = $provider->getSupportedModels();
// Returns array of supported model names (e.g., 'gpt-3.5-turbo', 'gpt-4')
$provider = new EncoderProvider();
$encoder = $provider->getForModel('gpt-4');
$tokenCount = count($encoder->encode("Your prompt here"));
if ($tokenCount > 8000) {
throw new \Exception("Prompt exceeds GPT-4's 8k token limit");
}
// Proceed with API call
Use getForModel() for OpenAI model compatibility:
$encoder = $provider->getForModel('gpt-4-1106-preview');
$tokens = $encoder->encode("Your input text");
For custom vocabularies (e.g., p50k_base):
$encoder = $provider->get('p50k_base');
$tokens = $encoder->encode("Custom vocabulary text");
Leverage Laravel’s cache for performance:
$provider = new EncoderProvider();
$provider->setVocabCache(storage_path('app/tiktoken_cache'));
// Cache persists across requests
// app/Http/Middleware/TokenLimit.php
public function handle($request, Closure $next) {
$provider = new EncoderProvider();
$encoder = $provider->getForModel('gpt-3.5-turbo');
$tokenCount = count($encoder->encode($request->input('prompt')));
if ($tokenCount > config('ai.max_tokens')) {
return response()->json(['error' => 'Prompt too long'], 400);
}
return $next($request);
}
$encoder = $provider->getForModel('text-embedding-3-large');
$text = "Very long document...";
$chunkSize = 5000; // Tokens
$chunks = [];
$currentChunk = [];
$currentTokenCount = 0;
foreach ($encoder->encodeInChunks($text, $chunkSize) as $chunk) {
$chunks[] = $chunk;
$currentTokenCount += count($chunk);
if ($currentTokenCount >= $chunkSize) {
// Process chunk
$currentChunk = [];
$currentTokenCount = 0;
}
}
// config/app.php
'providers' => [
Yethee\Tiktoken\TiktokenServiceProvider::class,
],
// app/Providers/AppServiceProvider.php
public function register() {
$this->app->singleton(EncoderProvider::class, function ($app) {
$provider = new EncoderProvider();
$provider->setVocabCache($app['config']['tiktoken.cache_dir']);
return $provider;
});
}
$model = request()->input('model', 'gpt-3.5-turbo');
$provider = app(EncoderProvider::class);
$encoder = $provider->getForModel($model);
use Illuminate\Support\Facades\Log;
$provider = new EncoderProvider();
$encoder = $provider->getForModel('gpt-4');
$tokens = $encoder->encode("User input");
Log::info('Token count', [
'model' => 'gpt-4',
'tokens' => count($tokens),
'user_id' => auth()->id(),
]);
// app/Jobs/TokenizeDocument.php
public function handle() {
$provider = new EncoderProvider();
$encoder = $provider->get('p50k_base');
$tokens = $encoder->encode($this->document->text);
// Store tokens or process further
}
public function testTokenization() {
$provider = new EncoderProvider();
$encoder = $provider->getForModel('gpt-3.5-turbo');
$tokens = $encoder->encode("Test input");
$this->assertEquals([9906, 1917, 0], $tokens);
}
Cache Directory Permissions
TIKTOKEN_CACHE_DIR (or default temp dir) is writable by PHP.chmod -R 755 storage/app/tiktoken_cache or use sys_get_temp_dir().Model Name Mismatches
getForModel() is case-sensitive and exact. Use getSupportedModels() to verify.Log::info('Supported models:', $provider->getSupportedModels());
GPT-2 Unsupported
InvalidArgumentException for GPT-2 models.p50k_base or cl100k_base as fallback.Special Tokens Missing
<|endofprompt|> and similar are not supported.LibEncoder Overhead
LibEncoder may slow down small texts (<100 tokens).$provider = new EncoderProvider(false)).Race Conditions in Cache
$provider->setVocabCache('cache');
Token Count Mismatches
Large Text Memory Issues
encodeInChunks() or process in smaller batches.Check Vocabulary Loading
$vocab = $provider->getVocab('p50k_base');
Log::debug('Vocab size:', $vocab->size());
Enable Verbose Logging
$provider = new EncoderProvider();
$provider->setVerbose(true); // Logs cache misses/hits
Validate Token IDs
$tokens = $encoder->encode("Test");
$this->assertAll(function ($token) {
return is_int($token) && $token >= 0;
}, $tokens);
Benchmark Performance
$start = microtime(true);
$tokens = $encoder->encode(str_repeat("a", 1000));
Log::info('Tokenization time:', microtime(true) - $start);
Custom Vocabularies
$customVocab = new \Yethee\Tiktoken\Vocab\Vocab($customTokens);
$provider->registerVocab('custom', $customVocab);
$encoder = $provider->get('custom');
Override Encoder Provider
$provider = new class extends EncoderProvider {
protected function createEncoder(string $vocabName): Encoder {
return new CustomEncoder($this->getVocab($vocabName));
}
};
Hook into Vocab Loading
$provider->setVocabLoader(function ($uri, $checksum) {
return new CustomVocabLoader()->load($uri, $checksum);
});
Extend Tokenization Logic
$tokens = $encoder->encode("Text");
$processed = array_map(function ($token) {
return $token * 2; // Example: Custom token transformation
}, $tokens);
$provider = new EncoderProvider();
foreach ($provider->getSupportedModels() as $model) {
$
How can I help you explore Laravel packages today?