doctrine/lexer
Doctrine Lexer is a lightweight base library for building lexers used in top-down, recursive descent parsers. It powers tokenization in Doctrine projects like Annotations and ORM (DQL), providing a reusable foundation for custom language parsing.
The Doctrine Lexer package provides a base AbstractLexer class to build custom tokenizers for parsing domain-specific languages (DSLs) — especially common in annotation systems (e.g., Doctrine Annotations), query languages (e.g., DQL), or configuration formats.
First step: Create a subclass of Doctrine\Common\Lexer\AbstractLexer and implement:
protected function getType(&string $value): int|string|null — returns the token type (e.g., integer constant or enum)protected function getMatch(): string — regex pattern used to match tokens (e.g., '[a-zA-Z_][a-zA-Z0-9_]*' for identifiers)Then instantiate your lexer with input text and call moveNext() and getCurrent() to iterate tokens. Example:
use Doctrine\Common\Lexer\AbstractLexer;
class MyLexer extends AbstractLexer
{
public const TOKEN_WORD = 1;
public const TOKEN_NUMBER = 2;
protected function getMatch(): string
{
return '[a-zA-Z_\x80-\xff][a-zA-Z0-9_\x80-\xff]*|-[0-9]+|[0-9]+|\.';
}
protected function getType(&string $value): int|string|null
{
if (is_numeric($value)) {
return is_int($value) || ctype_digit($value) ? self::TOKEN_NUMBER : self::TOKEN_WORD;
}
return self::TOKEN_WORD;
}
}
$lexer = new MyLexer('SELECT 42 FROM users');
while ($lexer->moveNext()) {
echo $lexer->getToken() . ' → ' . $lexer->peek() . "\n";
}
Check AbstractLexer source (src/AbstractLexer.php) for available helper methods: peek(), reset(), getPrevious(), and token offset tracking.
Use the lexer as the token producer feeding into your parser methods. Store current token state in a property for lookahead/rollback logic. Typical workflow:
class MyParser
{
private MyLexer $lexer;
public function parse(string $input): void
{
$this->lexer = new MyLexer($input);
$this->lexer->moveNext();
$this->statement();
}
private function statement(): void
{
if ($this->lexer->peek() === 'SELECT') {
$this->match(MyLexer::TOKEN_WORD); // or your const
// ... consume more tokens
}
}
private function match(int|string $type): void
{
if ($this->lexer->getToken() !== $type) {
throw new ParseError('Unexpected token');
}
$this->lexer->moveNext();
}
}
Leverage PHP enums for robust token definitions (requires v2.0+):
enum TokenType: string
{
case IDENTIFIER = 'identifier';
case INTEGER = 'integer';
case WHITESPACE = 'whitespace';
}
class MyLexer extends AbstractLexer
{
protected function getType(&string $value): ?TokenType
{
// ... return enum case or null for skipped tokens
}
}
This enables strict typing and IDE autocomplete, reducing runtime errors.
Subclass to maintain state (e.g., context-aware lexing):
class AnnotationLexer extends AbstractLexer
{
private bool $inAnnotation = false;
protected function getType(&string $value): int|string|null
{
if ($value === '@') {
$this->inAnnotation = true;
return '@';
}
if ($this->inAnnotation && $value === ')') {
$this->inAnnotation = false;
}
return $this->inAnnotation ? 'ANNOTATION_TOKEN' : 'DEFAULT';
}
}
int|string|null or enum cases — arrays/objects no longer supported.By default, whitespace is not skipped. Your lexer must explicitly ignore tokens (e.g., by returning null in getType() or omitting them from getMatch()). To mimic built-in behavior:
protected function getMatch(): string
{
return '\s+|[^\s]+'; // Match whitespace separately, then skip it
}
protected function getType(&string $value): int|string|null
{
if (trim($value) === '') return null; // Skip whitespace
// ... other cases
}
$lexer->getTokens() to dump full token list for testing.AbstractLexer::getError() for custom error context (includes line/column via internal index tracking).preg_* handling respect UTF-8 — use the u modifier and \p{L}/\p{N} if needed.$value by reference unnecessarily; it’s passed &string for performance but shouldn’t be altered beyond token recognition.AbstractLexer::getString() for custom token string extraction (e.g., raw unescaped input).AbstractLexer::get Token() (final in v3) via overriding internal state before parsing begins.How can I help you explore Laravel packages today?