Weave Code
Code Weaver
Helps Laravel developers discover, compare, and choose open-source packages. See popularity, security, maintainers, and scores at a glance to make better decisions.
Feedback
Share your thoughts, report bugs, or suggest improvements.
Subject
Message

Tiktoken Laravel Package

yethee/tiktoken

PHP port of OpenAI tiktoken for fast tokenization. Get encoders by model or encoding name, encode text to token IDs, with default vocab caching and configurable cache dir. Optional experimental FFI lib mode (tiktoken-rs) for better performance on larger texts.

View on GitHub
Deep Wiki
Context7

Getting Started

Install via Composer:

composer require yethee/tiktoken

First Use Case: Tokenize a prompt before sending to OpenAI API to validate length:

use Yethee\Tiktoken\EncoderProvider;

$provider = new EncoderProvider();
$encoder = $provider->getForModel('gpt-4'); // Auto-selects correct encoder
$tokens = $encoder->encode("Your user input here...");

if (count($tokens) > 4096) { // GPT-4 context limit
    throw new \RuntimeException("Prompt too long");
}

Key Starting Points:

  1. EncoderProvider - Central class for model-aware encoders
  2. getForModel('model-name') - Get encoder by OpenAI model name
  3. get('encoding-name') - Get encoder by explicit encoding (e.g., 'p50k_base')
  4. encode(string $text) - Core method returning token IDs

Configuration:

  • Cache directory defaults to system temp dir
  • Override via environment: TIKTOKEN_CACHE_DIR=/path/to/cache
  • Or programmatically: $provider->setVocabCache('/custom/path')

Implementation Patterns

Core Workflows

1. Token Validation Middleware

// app/Http/Middleware/ValidatePromptLength.php
public function handle(Request $request, Closure $next)
{
    $encoder = app(EncoderProvider::class)->getForModel('gpt-4');
    $tokens = $encoder->encode($request->prompt);

    if (count($tokens) > config('ai.max_tokens')) {
        return response()->json(['error' => 'Prompt too long'], 400);
    }

    return $next($request);
}

2. Token-Aware Prompt Builder

// app/Services/PromptBuilder.php
class PromptBuilder
{
    public function __construct(private EncoderProvider $provider) {}

    public function build(string $userInput, string $model): string
    {
        $encoder = $this->provider->getForModel($model);
        $tokens = $encoder->encode($userInput);

        // Truncate if needed
        if (count($tokens) > 3000) {
            $tokens = array_slice($tokens, 0, 3000);
            $userInput = $encoder->decode($tokens);
        }

        return "User: $userInput\nAssistant:";
    }
}

3. Batch Processing with Chunking

// app/Jobs/ProcessDocument.php
public function handle()
{
    $encoder = app(EncoderProvider::class)->getForModel('text-embedding-3-small');
    $text = file_get_contents('large_document.txt');

    // Process in chunks (simplified - see Gotchas for chunking)
    $chunkSize = 1000; // Tokens
    $chunks = array_chunk($encoder->encode($text), $chunkSize);

    foreach ($chunks as $chunk) {
        $this->generateEmbedding($encoder->decode($chunk));
    }
}

Integration Patterns

Service Container Binding

// config/app.php
'providers' => [
    // ...
    Yethee\Tiktoken\TiktokenServiceProvider::class,
],

// Then inject via constructor
public function __construct(private Yethee\Tiktoken\EncoderProvider $tiktoken) {}

Model-Specific Encoders

// app/Providers/AppServiceProvider.php
public function boot()
{
    $this->app->bind('gpt4.encoder', function () {
        return app(Yethee\Tiktoken\EncoderProvider::class)
            ->getForModel('gpt-4');
    });
}

Caching Strategies

// Cache encoder instances for performance
public function getEncoder(string $model): Yethee\Tiktoken\Encoder
{
    return cache()->remember("tiktoken.encoder.$model", now()->addHours(1), function() use ($model) {
        return app(Yethee\Tiktoken\EncoderProvider::class)
            ->getForModel($model);
    });
}

Gotchas and Tips

Common Pitfalls

1. Cache Invalidation

  • Vocabulary files are cached aggressively. If you update the package or vocab files:
    • Clear cache: php artisan cache:clear
    • Or manually delete cache directory
  • Tip: Use checksum validation for custom vocab files:
    $vocab = $provider->getVocabLoader()->load('custom.vocab', 'expected_checksum');
    

2. Token Counting Accuracy

  • Always use the model-specific encoder (getForModel()) for accurate counts
  • Gotcha: Using wrong encoder (e.g., GPT-3.5 encoder for GPT-4) gives incorrect token counts
  • Tip: Add validation:
    $model = 'gpt-4';
    $encoder = $provider->getForModel($model);
    assert($encoder->getEncodingName() === 'cl100k_base', "Wrong encoder for $model");
    

3. Chunking Limitations

  • encodeInChunks() is not implemented in current version (as of 1.0.0)
  • Workaround: Manual chunking:
    $tokens = $encoder->encode($text);
    $chunkSize = 1000;
    foreach (array_chunk($tokens, $chunkSize) as $chunk) {
        $decoded = $encoder->decode($chunk);
        // Process chunk
    }
    

4. Performance Quirks

  • Small texts: Native encoder is faster than LibEncoder due to FFI overhead
  • Large texts: LibEncoder may help (2–5x speedup) but requires setup
  • Benchmark first: Use composer bench to compare before enabling LibEncoder

5. Model Support Gaps

  • Unsupported models: GPT-2, models with special tokens like <|endofprompt|>
  • Tip: Check supported models in src/Vocab/VocabLoader.php

Advanced Tips

1. Custom Vocabularies

// Load custom vocabulary
$vocabLoader = $provider->getVocabLoader();
$customVocab = $vocabLoader->load('path/to/custom.vocab', 'checksum');

// Use with encoder
$encoder = new Yethee\Tiktoken\Encoder\NativeEncoder($customVocab);

2. LibEncoder Setup

// Initialize LibEncoder (do this once at app startup)
Yethee\Tiktoken\Encoder\LibEncoder::init(__DIR__.'/vendor/yethee/tiktoken/libtiktoken_php.so');

// Force lib mode for all encoders
$provider = new Yethee\Tiktoken\EncoderProvider(true);

3. Debugging Tokenization

// See what tokens correspond to which text
$tokens = $encoder->encode("Hello world!");
$tokenTexts = array_map(fn($id) => $encoder->decode([$id]), $tokens);
print_r($tokenTexts);

4. Memory Management

  • Vocab files are loaded into memory on first use
  • Tip: For memory-sensitive environments, preload vocab files during warmup:
    // In a route or command
    $provider->getForModel('gpt-4'); // Forces vocab load
    

5. Environment-Specific Config

// config/tiktoken.php
return [
    'cache_dir' => env('TIKTOKEN_CACHE_DIR', sys_get_temp_dir()),
    'use_lib' => env('TIKTOKEN_USE_LIB', false),
    'lib_path' => env('TIKTOKEN_LIB_PATH', null),
];

6. Handling Encoding Errors

try {
    $tokens = $encoder->encode($text);
} catch (Yethee\Tiktoken\Exception\InvalidText $e) {
    // Handle invalid UTF-8 or other encoding errors
    Log::error("Tokenization failed: {$e->getMessage()}");
    throw new \RuntimeException("Invalid input text");
}

Extension Points

1. Custom Encoder Implementation

// Implement Yethee\Tiktoken\Encoder interface
class MyCustomEncoder implements Yethee\Tiktoken\Encoder
{
    public function encode(string $text): array { /* ... */ }
    public function decode(array $tokens): string { /* ... */ }
    public function getEncodingName(): string { /* ... */ }
}

// Register with provider
$provider->registerEncoder('my_encoding', new MyCustomEncoder());

2. Vocab Loader Extensions

// Extend vocab loading logic
$loader = $provider->getVocabLoader();
$loader->addVocabSource('s3', function($uri) {
    return file_get_contents("s3://$uri");
});

3. Token Filtering

// Filter tokens before processing
$tokens = $encoder->encode($text);
$filtered = array_filter($tokens, fn($id) => $id !== 0); // Skip padding
Weaver

How can I help you explore Laravel packages today?

Conversation history is not saved when not logged in.
Prompt
Add packages to context
No packages found.
davejamesmiller/laravel-breadcrumbs
artisanry/parsedown
christhompsontldr/phpsdk
enqueue/dsn
bunny/bunny
enqueue/test
enqueue/null
enqueue/amqp-tools
milesj/emojibase
bower-asset/punycode
bower-asset/inputmask
bower-asset/jquery
bower-asset/yii2-pjax
laravel/nova
spatie/laravel-mailcoach
spatie/laravel-superseeder
laravel/liferaft
nst/json-test-suite
danielmiessler/sec-lists
jackalope/jackalope-transport