Weave Code
Code Weaver
Helps Laravel developers discover, compare, and choose open-source packages. See popularity, security, maintainers, and scores at a glance to make better decisions.
Feedback
Share your thoughts, report bugs, or suggest improvements.
Subject
Message

Markdown Object Laravel Package

benbjurstrom/markdown-object

Intelligent Markdown chunking for LLM/RAG workflows. Preserves headings, lists, tables, and semantic relationships while splitting into token-aware chunks sized for embedding model context windows. Built on League CommonMark with TikToken support.

View on GitHub
Deep Wiki
Context7

Getting Started

Start by installing the package and understanding its core workflow: parse Markdown into an AST → build a structured MarkdownObject → chunk it into token-aware segments. First, require the package via Composer, then replicate the basic usage example from the README. Use the included interactive demo repository (markdown-object-demo) to experiment with real inputs and adjust target/hardCap parameters visually—this is the fastest way to internalize how hierarchical chunking works. Begin with target: 512, hardCap: 1024 and observe how sections split at headings before paragraph-level chopping.

Implementation Patterns

  • RAG Preprocessing Pipeline: Use MarkdownObjectBuilder to convert Markdown into a token-precise object; call toMarkdownChunks() to generate embeddings-friendly segments while preserving breadcrumbs (headings) and source positions—critical for traceability in Retrieval-Augmented Generation.
  • Context-Aware Routing: Extract chunk metadata (e.g., $chunk->breadcrumb) to route relevant content fragments in LLM prompts (e.g., only include "Architecture" section chunks for system design queries).
  • Persistent Serialization: Serialize full MarkdownObject instances to JSON for caching chunking results or distributing tasks across workers; deserialize later to skip parsing/tokenizing expensive documents again.
  • Table Handling in Docs: Leverage repeatTableHeaders: true when chunking technical docs with tables to preserve context across splits—especially important for schema or API reference tables.
  • Custom Tokenizer Strategy: Bind tokenizer to specific LLM models (gpt-4o, claude-3.5-sonnet) dynamically in CI/CD pipelines or user-specific configs, ensuring accurate token counts across model families.

Gotchas and Tips

  • Token Count Discrepancies: The chunk’s tokenCount includes \n\n separators—expect +5–12 tokens over naive sums. Always pass the same tokenizer to both build() and toMarkdownChunks(); mismatched tokenizers cause silent overflows in embeddings.
  • Hard Cap vs. Target Confusion: hardCap controls hierarchical splits (e.g., skipping whole H2 sections), while target drives paragraph/code/table slicing. If chunks are still too large, reduce hardCap first—it’s the gatekeeper for structural integrity.
  • Position Tracking Gotcha: $chunk->sourcePosition may be null for very small or synthetic docs; always guard against this before accessing line numbers.
  • Extension Requirements: Ensure League CommonMark’s extensions (e.g., TableExtension) match the input Markdown’s features; missing extensions can strip semantics, leading to suboptimal chunks (e.g., tables become plain text).
  • Performance Tip: For large batches, reuse TikTokenizer instances—it holds internal bpe vocabulary and encoding tables. Avoid re-instantiating per document; cache it as a singleton service.
  • Debugging: Use json_encode($chunk, JSON_PRETTY_PRINT) to inspect chunk metadata structure. The breadcrumb array and sourcePosition are indispensable for diagnosing incorrect splits or misrouted chunks.
Weaver

How can I help you explore Laravel packages today?

Conversation history is not saved when not logged in.
Prompt
Add packages to context
No packages found.
davejamesmiller/laravel-breadcrumbs
artisanry/parsedown
christhompsontldr/phpsdk
enqueue/dsn
bunny/bunny
enqueue/test
enqueue/null
enqueue/amqp-tools
milesj/emojibase
bower-asset/punycode
bower-asset/inputmask
bower-asset/jquery
bower-asset/yii2-pjax
laravel/nova
spatie/laravel-mailcoach
spatie/laravel-superseeder
laravel/liferaft
nst/json-test-suite
danielmiessler/sec-lists
jackalope/jackalope-transport