benbjurstrom/markdown-object
Intelligent Markdown chunking for LLM/RAG workflows. Preserves headings, lists, tables, and semantic relationships while splitting into token-aware chunks sized for embedding model context windows. Built on League CommonMark with TikToken support.
Start by installing the package and understanding its core workflow: parse Markdown into an AST → build a structured MarkdownObject → chunk it into token-aware segments. First, require the package via Composer, then replicate the basic usage example from the README. Use the included interactive demo repository (markdown-object-demo) to experiment with real inputs and adjust target/hardCap parameters visually—this is the fastest way to internalize how hierarchical chunking works. Begin with target: 512, hardCap: 1024 and observe how sections split at headings before paragraph-level chopping.
MarkdownObjectBuilder to convert Markdown into a token-precise object; call toMarkdownChunks() to generate embeddings-friendly segments while preserving breadcrumbs (headings) and source positions—critical for traceability in Retrieval-Augmented Generation.$chunk->breadcrumb) to route relevant content fragments in LLM prompts (e.g., only include "Architecture" section chunks for system design queries).MarkdownObject instances to JSON for caching chunking results or distributing tasks across workers; deserialize later to skip parsing/tokenizing expensive documents again.repeatTableHeaders: true when chunking technical docs with tables to preserve context across splits—especially important for schema or API reference tables.gpt-4o, claude-3.5-sonnet) dynamically in CI/CD pipelines or user-specific configs, ensuring accurate token counts across model families.tokenCount includes \n\n separators—expect +5–12 tokens over naive sums. Always pass the same tokenizer to both build() and toMarkdownChunks(); mismatched tokenizers cause silent overflows in embeddings.hardCap controls hierarchical splits (e.g., skipping whole H2 sections), while target drives paragraph/code/table slicing. If chunks are still too large, reduce hardCap first—it’s the gatekeeper for structural integrity.$chunk->sourcePosition may be null for very small or synthetic docs; always guard against this before accessing line numbers.TableExtension) match the input Markdown’s features; missing extensions can strip semantics, leading to suboptimal chunks (e.g., tables become plain text).TikTokenizer instances—it holds internal bpe vocabulary and encoding tables. Avoid re-instantiating per document; cache it as a singleton service.json_encode($chunk, JSON_PRETTY_PRINT) to inspect chunk metadata structure. The breadcrumb array and sourcePosition are indispensable for diagnosing incorrect splits or misrouted chunks.How can I help you explore Laravel packages today?