Getting Started

Start by installing the package and understanding its core workflow: parse Markdown into an AST → build a structured MarkdownObject → chunk it into token-aware segments. First, require the package via Composer, then replicate the basic usage example from the README. Use the included interactive demo repository (markdown-object-demo) to experiment with real inputs and adjust target/hardCap parameters visually—this is the fastest way to internalize how hierarchical chunking works. Begin with target: 512, hardCap: 1024 and observe how sections split at headings before paragraph-level chopping.

Implementation Patterns

RAG Preprocessing Pipeline: Use MarkdownObjectBuilder to convert Markdown into a token-precise object; call toMarkdownChunks() to generate embeddings-friendly segments while preserving breadcrumbs (headings) and source positions—critical for traceability in Retrieval-Augmented Generation.
Context-Aware Routing: Extract chunk metadata (e.g., $chunk->breadcrumb) to route relevant content fragments in LLM prompts (e.g., only include "Architecture" section chunks for system design queries).
Persistent Serialization: Serialize full MarkdownObject instances to JSON for caching chunking results or distributing tasks across workers; deserialize later to skip parsing/tokenizing expensive documents again.
Table Handling in Docs: Leverage repeatTableHeaders: true when chunking technical docs with tables to preserve context across splits—especially important for schema or API reference tables.
Custom Tokenizer Strategy: Bind tokenizer to specific LLM models (gpt-4o, claude-3.5-sonnet) dynamically in CI/CD pipelines or user-specific configs, ensuring accurate token counts across model families.

Gotchas and Tips

Token Count Discrepancies: The chunk’s tokenCount includes \n\n separators—expect +5–12 tokens over naive sums. Always pass the same tokenizer to both build() and toMarkdownChunks(); mismatched tokenizers cause silent overflows in embeddings.
Hard Cap vs. Target Confusion: hardCap controls hierarchical splits (e.g., skipping whole H2 sections), while target drives paragraph/code/table slicing. If chunks are still too large, reduce hardCap first—it’s the gatekeeper for structural integrity.
Position Tracking Gotcha: $chunk->sourcePosition may be null for very small or synthetic docs; always guard against this before accessing line numbers.
Extension Requirements: Ensure League CommonMark’s extensions (e.g., TableExtension) match the input Markdown’s features; missing extensions can strip semantics, leading to suboptimal chunks (e.g., tables become plain text).
Performance Tip: For large batches, reuse TikTokenizer instances—it holds internal bpe vocabulary and encoding tables. Avoid re-instantiating per document; cache it as a singleton service.
Debugging: Use json_encode($chunk, JSON_PRETTY_PRINT) to inspect chunk metadata structure. The breadcrumb array and sourcePosition are indispensable for diagnosing incorrect splits or misrouted chunks.

Markdown Object Laravel Package

Getting Started

Implementation Patterns

Gotchas and Tips