Weave Code
Code Weaver
Helps Laravel developers discover, compare, and choose open-source packages. See popularity, security, maintainers, and scores at a glance to make better decisions.
Feedback
Share your thoughts, report bugs, or suggest improvements.
Subject
Message

Tiktoken Laravel Package

yethee/tiktoken

PHP port of OpenAI’s tiktoken tokenizer. Get encoders by model name, encode text to token IDs, and cache vocab files for speed. Optional experimental Rust/FFI “lib mode” for faster encoding of medium/large texts.

View on GitHub
Deep Wiki
Context7

Getting Started

Minimal Setup

  1. Install the package:
    composer require yethee/tiktoken
    
  2. Basic tokenization:
    use Yethee\Tiktoken\EncoderProvider;
    
    $provider = new EncoderProvider();
    $encoder = $provider->getForModel('gpt-3.5-turbo-0301');
    $tokens = $encoder->encode('Hello world!');
    // Returns: [9906, 1917, 0]
    
  3. Check supported models:
    $supportedModels = $provider->getSupportedModels();
    // Returns array of supported model names (e.g., 'gpt-3.5-turbo', 'gpt-4')
    

First Use Case: Token Counting for API Calls

$provider = new EncoderProvider();
$encoder = $provider->getForModel('gpt-4');
$tokenCount = count($encoder->encode("Your prompt here"));
if ($tokenCount > 8000) {
    throw new \Exception("Prompt exceeds GPT-4's 8k token limit");
}
// Proceed with API call

Implementation Patterns

1. Model-Specific Tokenization

Use getForModel() for OpenAI model compatibility:

$encoder = $provider->getForModel('gpt-4-1106-preview');
$tokens = $encoder->encode("Your input text");

2. Vocabulary-Based Encoding

For custom vocabularies (e.g., p50k_base):

$encoder = $provider->get('p50k_base');
$tokens = $encoder->encode("Custom vocabulary text");

3. Caching Integration

Leverage Laravel’s cache for performance:

$provider = new EncoderProvider();
$provider->setVocabCache(storage_path('app/tiktoken_cache'));
// Cache persists across requests

4. Middleware for Token Validation

// app/Http/Middleware/TokenLimit.php
public function handle($request, Closure $next) {
    $provider = new EncoderProvider();
    $encoder = $provider->getForModel('gpt-3.5-turbo');
    $tokenCount = count($encoder->encode($request->input('prompt')));
    if ($tokenCount > config('ai.max_tokens')) {
        return response()->json(['error' => 'Prompt too long'], 400);
    }
    return $next($request);
}

5. Batch Processing with Chunks

$encoder = $provider->getForModel('text-embedding-3-large');
$text = "Very long document...";
$chunkSize = 5000; // Tokens
$chunks = [];
$currentChunk = [];
$currentTokenCount = 0;

foreach ($encoder->encodeInChunks($text, $chunkSize) as $chunk) {
    $chunks[] = $chunk;
    $currentTokenCount += count($chunk);
    if ($currentTokenCount >= $chunkSize) {
        // Process chunk
        $currentChunk = [];
        $currentTokenCount = 0;
    }
}

6. Service Container Binding

// config/app.php
'providers' => [
    Yethee\Tiktoken\TiktokenServiceProvider::class,
],
// app/Providers/AppServiceProvider.php
public function register() {
    $this->app->singleton(EncoderProvider::class, function ($app) {
        $provider = new EncoderProvider();
        $provider->setVocabCache($app['config']['tiktoken.cache_dir']);
        return $provider;
    });
}

7. Dynamic Model Selection

$model = request()->input('model', 'gpt-3.5-turbo');
$provider = app(EncoderProvider::class);
$encoder = $provider->getForModel($model);

8. Logging Token Usage

use Illuminate\Support\Facades\Log;

$provider = new EncoderProvider();
$encoder = $provider->getForModel('gpt-4');
$tokens = $encoder->encode("User input");
Log::info('Token count', [
    'model' => 'gpt-4',
    'tokens' => count($tokens),
    'user_id' => auth()->id(),
]);

9. Queue Jobs for Heavy Tokenization

// app/Jobs/TokenizeDocument.php
public function handle() {
    $provider = new EncoderProvider();
    $encoder = $provider->get('p50k_base');
    $tokens = $encoder->encode($this->document->text);
    // Store tokens or process further
}

10. Testing Tokenization

public function testTokenization() {
    $provider = new EncoderProvider();
    $encoder = $provider->getForModel('gpt-3.5-turbo');
    $tokens = $encoder->encode("Test input");
    $this->assertEquals([9906, 1917, 0], $tokens);
}

Gotchas and Tips

Pitfalls

  1. Cache Directory Permissions

    • Ensure TIKTOKEN_CACHE_DIR (or default temp dir) is writable by PHP.
    • Fix: chmod -R 755 storage/app/tiktoken_cache or use sys_get_temp_dir().
  2. Model Name Mismatches

    • getForModel() is case-sensitive and exact. Use getSupportedModels() to verify.
    • Fix: Log supported models on startup:
      Log::info('Supported models:', $provider->getSupportedModels());
      
  3. GPT-2 Unsupported

    • Throws InvalidArgumentException for GPT-2 models.
    • Workaround: Use p50k_base or cl100k_base as fallback.
  4. Special Tokens Missing

    • <|endofprompt|> and similar are not supported.
    • Workaround: Prepend/suffix text manually if needed.
  5. LibEncoder Overhead

    • FFI-based LibEncoder may slow down small texts (<100 tokens).
    • Tip: Use native encoder by default ($provider = new EncoderProvider(false)).
  6. Race Conditions in Cache

    • Vocabulary cache updates can fail under high concurrency.
    • Fix: Use Laravel’s cache instead of filesystem:
      $provider->setVocabCache('cache');
      
  7. Token Count Mismatches

    • OpenAI’s API and this package may differ slightly for edge cases.
    • Tip: Validate with OpenAI’s API for critical use cases.
  8. Large Text Memory Issues

    • Encoding very long texts (>1MB) may hit PHP memory limits.
    • Fix: Use encodeInChunks() or process in smaller batches.

Debugging Tips

  1. Check Vocabulary Loading

    $vocab = $provider->getVocab('p50k_base');
    Log::debug('Vocab size:', $vocab->size());
    
  2. Enable Verbose Logging

    $provider = new EncoderProvider();
    $provider->setVerbose(true); // Logs cache misses/hits
    
  3. Validate Token IDs

    $tokens = $encoder->encode("Test");
    $this->assertAll(function ($token) {
        return is_int($token) && $token >= 0;
    }, $tokens);
    
  4. Benchmark Performance

    $start = microtime(true);
    $tokens = $encoder->encode(str_repeat("a", 1000));
    Log::info('Tokenization time:', microtime(true) - $start);
    

Extension Points

  1. Custom Vocabularies

    $customVocab = new \Yethee\Tiktoken\Vocab\Vocab($customTokens);
    $provider->registerVocab('custom', $customVocab);
    $encoder = $provider->get('custom');
    
  2. Override Encoder Provider

    $provider = new class extends EncoderProvider {
        protected function createEncoder(string $vocabName): Encoder {
            return new CustomEncoder($this->getVocab($vocabName));
        }
    };
    
  3. Hook into Vocab Loading

    $provider->setVocabLoader(function ($uri, $checksum) {
        return new CustomVocabLoader()->load($uri, $checksum);
    });
    
  4. Extend Tokenization Logic

    $tokens = $encoder->encode("Text");
    $processed = array_map(function ($token) {
        return $token * 2; // Example: Custom token transformation
    }, $tokens);
    

Performance Quirks

  • Cache Warmup: First request for a model may be slower due to vocab loading. Fix: Preload in a command:
    $provider = new EncoderProvider();
    foreach ($provider->getSupportedModels() as $model) {
        $
    
Weaver

How can I help you explore Laravel packages today?

Conversation history is not saved when not logged in.
Prompt
Add packages to context
No packages found.
hamzi/corewatch
minionfactory/raw-hydrator
hexters/coinpayment
rjcodes/rjcms
act-training/laravel-permissions-manager
alimarchal/laravel-chart-of-accounts
babenkoivan/elastic-scout-driver
mkwebdesign/filament-watchdog-v5
renatomarinho/laravel-page-speed
zedmagdy/filament-business-hours
renatovdemoura/blade-elements-ui
devgeek/beacon-admin
benjamin-rqt/data-watcher-bundle
atriumphp/atrium
sandermuller/package-boost-laravel
sandermuller/boost-skills
redaxo/core
yusufgenc/filament-api-forge
l3aro/rating-star-for-filament
leek/filament-subtenant-scope