Getting Started

The Doctrine Lexer package provides a base AbstractLexer class to build custom tokenizers for parsing domain-specific languages (DSLs) — especially common in annotation systems (e.g., Doctrine Annotations), query languages (e.g., DQL), or configuration formats.

First step: Create a subclass of Doctrine\Common\Lexer\AbstractLexer and implement:

protected function getType(&string $value): int|string|null — returns the token type (e.g., integer constant or enum)
protected function getMatch(): string — regex pattern used to match tokens (e.g., '[a-zA-Z_][a-zA-Z0-9_]*' for identifiers)

Then instantiate your lexer with input text and call moveNext() and getCurrent() to iterate tokens. Example:

use Doctrine\Common\Lexer\AbstractLexer;

class MyLexer extends AbstractLexer
{
    public const TOKEN_WORD = 1;
    public const TOKEN_NUMBER = 2;

    protected function getMatch(): string
    {
        return '[a-zA-Z_\x80-\xff][a-zA-Z0-9_\x80-\xff]*|-[0-9]+|[0-9]+|\.';
    }

    protected function getType(&string $value): int|string|null
    {
        if (is_numeric($value)) {
            return is_int($value) || ctype_digit($value) ? self::TOKEN_NUMBER : self::TOKEN_WORD;
        }
        return self::TOKEN_WORD;
    }
}

$lexer = new MyLexer('SELECT 42 FROM users');
while ($lexer->moveNext()) {
    echo $lexer->getToken() . ' → ' . $lexer->peek() . "\n";
}

Check AbstractLexer source (src/AbstractLexer.php) for available helper methods: peek(), reset(), getPrevious(), and token offset tracking.

Implementation Patterns

1. Integration with Recursive Descent Parsing

Use the lexer as the token producer feeding into your parser methods. Store current token state in a property for lookahead/rollback logic. Typical workflow:

class MyParser
{
    private MyLexer $lexer;

    public function parse(string $input): void
    {
        $this->lexer = new MyLexer($input);
        $this->lexer->moveNext();
        $this->statement();
    }

    private function statement(): void
    {
        if ($this->lexer->peek() === 'SELECT') {
            $this->match(MyLexer::TOKEN_WORD); // or your const
            // ... consume more tokens
        }
    }

    private function match(int|string $type): void
    {
        if ($this->lexer->getToken() !== $type) {
            throw new ParseError('Unexpected token');
        }
        $this->lexer->moveNext();
    }
}

2. Type-Safe Token Enums (v2.0+)

Leverage PHP enums for robust token definitions (requires v2.0+):

enum TokenType: string
{
    case IDENTIFIER = 'identifier';
    case INTEGER = 'integer';
    case WHITESPACE = 'whitespace';
}

class MyLexer extends AbstractLexer
{
    protected function getType(&string $value): ?TokenType
    {
        // ... return enum case or null for skipped tokens
    }
}

This enables strict typing and IDE autocomplete, reducing runtime errors.

3. Extensibility via Properties & State

Subclass to maintain state (e.g., context-aware lexing):

class AnnotationLexer extends AbstractLexer
{
    private bool $inAnnotation = false;

    protected function getType(&string $value): int|string|null
    {
        if ($value === '@') {
            $this->inAnnotation = true;
            return '@';
        }
        if ($this->inAnnotation && $value === ')') {
            $this->inAnnotation = false;
        }
        return $this->inAnnotation ? 'ANNOTATION_TOKEN' : 'DEFAULT';
    }
}

Gotchas and Tips

⚠️ Backward Compatibility Notes

v3.0+ drops PHP < 8.1 support and removes legacy BC layers. Ensure runtime compatibility before upgrading from v1.x/v2.x.
In v2.0+, token types must be either int|string|null or enum cases — arrays/objects no longer supported.

⚠️ Whitespace & Skipped Tokens

By default, whitespace is not skipped. Your lexer must explicitly ignore tokens (e.g., by returning null in getType() or omitting them from getMatch()). To mimic built-in behavior:

protected function getMatch(): string
{
    return '\s+|[^\s]+'; // Match whitespace separately, then skip it
}

protected function getType(&string $value): int|string|null
{
    if (trim($value) === '') return null; // Skip whitespace
    // ... other cases
}

🛠️ Debugging Tips

Use $lexer->getTokens() to dump full token list for testing.
Override AbstractLexer::getError() for custom error context (includes line/column via internal index tracking).
For multibyte support (v1.2+), ensure your regex and preg_* handling respect UTF-8 — use the u modifier and \p{L}/\p{N} if needed.

🛠️ Performance Optimization

Cache expensive regex patterns in a static property if reused across many instances.
Avoid modifying $value by reference unnecessarily; it’s passed &string for performance but shouldn’t be altered beyond token recognition.

🔌 Extension Points

Override AbstractLexer::getString() for custom token string extraction (e.g., raw unescaped input).
Extend AbstractLexer::get Token() (final in v3) via overriding internal state before parsing begins.

Lexer Laravel Package