yethee/tiktoken
PHP port of OpenAI tiktoken for fast tokenization. Get encoders by model or encoding name, encode text to token IDs, with default vocab caching and configurable cache dir. Optional experimental FFI lib mode (tiktoken-rs) for better performance on larger texts.
This is a port of the tiktoken.
$ composer require yethee/tiktoken
use Yethee\Tiktoken\EncoderProvider;
$provider = new EncoderProvider();
$encoder = $provider->getForModel('gpt-3.5-turbo-0301');
$tokens = $encoder->encode('Hello world!');
print_r($tokens);
// OUT: [9906, 1917, 0]
$encoder = $provider->get('p50k_base');
$tokens = $encoder->encode('Hello world!');
print_r($tokens);
// OUT: [15496, 995, 0]
The encoder uses an external vocabularies, so caching is used by default to avoid performance issues.
By default, the directory for temporary files is used.
You can override the directory for cache via environment variable TIKTOKEN_CACHE_DIR
or use EncoderProvider::setVocabCache():
use Yethee\Tiktoken\EncoderProvider;
$encProvider = new EncoderProvider();
$encProvider->setVocabCache('/path/to/cache');
// Using the provider
Experimental
You can use tiktoken-rs library via FFI binding. This can improve performance when need to encode medium or large texts. However, the overhead of data marshalling can lead to poor performance for small texts.
use Yethee\Tiktoken\Encoder\LibEncoder;
use Yethee\Tiktoken\EncoderProvider;
// LibEncoder::init('/path/to/lib');
$encProvider = new EncoderProvider(true); // Force using the lib encoder
You need to provide path to the lib before using the provider. There are several ways to do this:
Yethee\Tiktoken\Encoder\LibEncoder::init() method.Yethee\Tiktoken\Encoder\LibEncoder::preload() method, inside opcache preload script.TIKTOKEN_LIB_PATH or LD_LIBRARY_PATHgit clone git@github.com:yethee/tiktoken-php.git
cd tiktoken-php
cargo build --release
Copy binary from target/release:
libtiktoken_php.so for linuxlibtiktoken_php.dylib for MacOStiktoken_php.dll for WindowsNOTE: You can see .docker/Dockefile for an example.
You can see benchmark result in #27 or run it locally:
composer bench
Yethee\Tiktoken\Encoder\LibEncoder::encodeInChunks() method<|endofprompt|>) are not supported.How can I help you explore Laravel packages today?