Weave Code
Code Weaver
Helps Laravel developers discover, compare, and choose open-source packages. See popularity, security, maintainers, and scores at a glance to make better decisions.
Feedback
Share your thoughts, report bugs, or suggest improvements.
Subject
Message

Markdown Object Laravel Package

benbjurstrom/markdown-object

Intelligent Markdown chunking for LLM/RAG workflows. Preserves headings, lists, tables, and semantic relationships while splitting into token-aware chunks sized for embedding model context windows. Built on League CommonMark with TikToken support.

View on GitHub
Deep Wiki
Context7

Markdown Object

Latest Version on Packagist GitHub Tests Action Status GitHub Code Style Action Status GitHub PHPStan Action Status Total Downloads

Intelligent Markdown chunking that preserves document structure and semantic relationships. Creates token-aware chunks optimized for embedding model context windows. Built on League CommonMark and Yethee\Tiktoken.

Try It Out

Clone the Interactive Demo to experiment with chunking in real-time. Paste your Markdown, adjust parameters, and see how content gets split into semantic chunks.

Basic Usage

use League\CommonMark\Environment\Environment;
use League\CommonMark\Parser\MarkdownParser;
use League\CommonMark\Extension\CommonMark\CommonMarkCoreExtension;
use League\CommonMark\Extension\Table\TableExtension;
use BenBjurstrom\MarkdownObject\Build\MarkdownObjectBuilder;
use BenBjurstrom\MarkdownObject\Tokenizer\TikTokenizer;

// 1) Parse Markdown with CommonMark
$env = new Environment();
$env->addExtension(new CommonMarkCoreExtension());
$env->addExtension(new TableExtension());

$parser   = new MarkdownParser($env);
$filename = 'guide.md';
$markdown = file_get_contents($filename);
$doc      = $parser->parse($markdown);

// 2) Build the structured model
$builder   = new MarkdownObjectBuilder();
$tokenizer = TikTokenizer::forModel('gpt-3.5-turbo');
$mdObj     = $builder->build($doc, $filename, $markdown, $tokenizer);

// 3) Emit hierarchically-packed chunks
$chunks = $mdObj->toMarkdownChunks(target: 512, hardCap: 1024);

foreach ($chunks as $chunk) {
    echo "---\n";
    echo "Chunk: {$chunk->id} | {$chunk->tokenCount} tokens";

    // Source position tracking for finding chunks in original document
    $pos = $chunk->sourcePosition;
    if ($pos->lines !== null) {
        echo " | Line: {$pos->lines->startLine}";
    }
    echo "\n";
    echo implode(' › ', $chunk->breadcrumb) . "\n";
    echo "---\n\n";
    echo $chunk->markdown . "\n\n";
}

/*
---
Chunk: 1 | 163 tokens | Line: 1
demo.md › Getting Started
---

# Getting Started

Welcome to the Markdown Object demo! This tool helps you visualize how markdown is parsed and chunked.

## Features

### Real-time Processing

Type or paste markdown in the left pane and see the results instantly.

### Hierarchical Chunking

Content is automatically organized into semantic chunks that keep related information together…

---
Chunk: 2 | 287 tokens | Line: 18
demo.md › Getting Started › Advanced Options
---

## Advanced Options

Configure chunking parameters to see how different settings affect the output.

### Token Limits

Adjust the target and hard cap values to control chunk sizes…
*/

Installation

You can install the package via composer:

composer require benbjurstrom/markdown-object

Advanced Usage

JSON Serialization

// Serialize to JSON
$json = $mdObj->toJson(JSON_PRETTY_PRINT);

// Deserialize from JSON
$copy = \BenBjurstrom\MarkdownObject\Model\MarkdownObject::fromJson($json);

Custom Tokenizer

use BenBjurstrom\MarkdownObject\Tokenizer\TikTokenizer;

// Use a different model
$tokenizer = TikTokenizer::forModel('gpt-4');

// Or use a specific encoding
$tokenizer = TikTokenizer::forEncoding('p50k_base');

// Pass to both build() and toMarkdownChunks()
$mdObj = $builder->build($doc, $filename, $markdown, $tokenizer);
$chunks = $mdObj->toMarkdownChunks(
    target: 512,
    hardCap: 1024,
    tok: $tokenizer
);

Custom Chunking Parameters

$chunks = $mdObj->toMarkdownChunks(
    target: 256,                // Smaller target for content splitting
    hardCap: 512,               // Smaller hard cap for hierarchy
    tok: $customTokenizer,      // Optional: use different tokenizer
    repeatTableHeaders: false   // Optional: don't repeat headers in split tables
);

A note on Token Counts

Chunk token counts include separator tokens (\n\n) added when joining content pieces, so they may be slightly higher than the sum of individual node tokens. This is expected and ensures the count accurately reflects what will be embedded.

// Build-time: sum of nodes (no separators)
echo $mdObj->tokenCount;  // e.g., 155

// Chunk: includes \n\n separators between elements
echo $chunks[0]->tokenCount;  // e.g., 163 (8 tokens higher)

Chunking Strategy

The package uses hierarchical greedy packing to create semantically coherent chunks that respect your document's natural structure.

Algorithm Overview

The chunker intelligently splits content using a two-threshold system:

  • target - Soft limit for splitting large content blocks (paragraphs, code, tables)
  • hardCap - Hard limit for hierarchical decisions (when to split vs. keep sections together)

How It Works

  1. Start whole – If the entire document fits within hardCap, return as a single chunk
  2. Split hierarchically – When too large, split at the highest heading level (H1, then H2, etc.)
  3. Pack greedily – Combine sibling sections that fit together within hardCap
  4. Recurse deeply – Sections that don't fit are processed recursively with updated breadcrumbs
  5. Minimize fragments – After recursion, continue packing remaining siblings to avoid orphaned content
  6. Split smartly – Long paragraphs, code blocks, and tables break at target boundaries while preserving readability

Testing

Run the tests with:

composer test

Documentation

For detailed architecture documentation, see ARCHITECTURE.md.

For examples of hierarchical packing behavior, see EXAMPLES.md.

Changelog

Please see CHANGELOG for more information on what has changed recently.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Security Vulnerabilities

Please review our security policy on how to report security vulnerabilities.

Credits

License

The MIT License (MIT). Please see License File for more information.

Weaver

How can I help you explore Laravel packages today?

Conversation history is not saved when not logged in.
Prompt
Add packages to context
No packages found.
davejamesmiller/laravel-breadcrumbs
artisanry/parsedown
christhompsontldr/phpsdk
enqueue/dsn
bunny/bunny
enqueue/test
enqueue/null
enqueue/amqp-tools
milesj/emojibase
bower-asset/punycode
bower-asset/inputmask
bower-asset/jquery
bower-asset/yii2-pjax
laravel/nova
spatie/laravel-mailcoach
spatie/laravel-superseeder
laravel/liferaft
nst/json-test-suite
danielmiessler/sec-lists
jackalope/jackalope-transport