Weave Code
Code Weaver
Helps Laravel developers discover, compare, and choose open-source packages. See popularity, security, maintainers, and scores at a glance to make better decisions.
Feedback
Share your thoughts, report bugs, or suggest improvements.
Subject
Message

Pdfparser Laravel Package

smalot/pdfparser

Standalone PHP library to parse PDF files and extract content. Reads objects/headers, metadata, and ordered page text; supports compressed PDFs and various encodings. Configure parsing via custom configs. Note: no support for secured PDFs or form data.

View on GitHub
Deep Wiki
Context7
## Getting Started

### Minimal Setup
1. **Installation**:
   ```bash
   composer require smalot/pdfparser

Ensure your project uses PHP 7.1+ (Laravel 5.5+ compatible).

  1. First Use Case: Parse a PDF file and extract raw text:

    use Smalot\PdfParser\Parser;
    
    $parser = new Parser();
    $pdf = $parser->parseFile(storage_path('app/uploads/document.pdf'));
    $text = $pdf->getText();
    
  2. Key Entry Points:

    • Parser::parseFile(): Parse a local file.
    • Parser::parseContent(): Parse raw PDF content (e.g., from a request).
    • $pdf->getText(): Get concatenated text from all pages.
    • $pdf->getPages(): Access individual pages as PDFPage objects.
  3. Where to Look First:


Implementation Patterns

Core Workflows

1. Basic Text Extraction

$parser = new Parser();
$pdf = $parser->parseFile($filePath);

// Get all text (concatenated)
$fullText = $pdf->getText();

// Get text per page
foreach ($pdf->getPages() as $page) {
    $pageText = $page->getText();
    // Process per-page text (e.g., save to DB, search for keywords)
}

2. Metadata Extraction

$metadata = $pdf->getMetadata();
$author = $metadata['author'] ?? 'Unknown';
$title = $metadata['title'] ?? 'Untitled';

3. Page-Specific Processing

foreach ($pdf->getPages() as $index => $page) {
    $text = $page->getText();
    $width = $page->getWidth();  // Added in v2.10.0
    $height = $page->getHeight();

    // Example: Log text with page dimensions
    \Log::info("Page {$index}: Width={$width}px, Height={$height}px. Text length: " . strlen($text));
}

4. Handling Large PDFs

  • Use Parser::parseContent() for streamed uploads (e.g., from HTTP requests):

    $parser = new Parser();
    $pdf = $parser->parseContent($request->getContent());
    
  • Chunk Processing: For very large files, process pages incrementally:

    $parser = new Parser();
    $pdf = $parser->parseFile($largeFilePath);
    
    foreach ($pdf->getPages() as $page) {
        $text = $page->getText();
        // Save to DB in batches or stream to a file
    }
    

Integration Tips

Laravel-Specific Patterns

  1. File Upload Handling:

    public function upload(UploadRequest $request) {
        $file = $request->file('pdf');
        $parser = new Parser();
        $pdf = $parser->parseFile($file->path());
    
        // Store metadata in DB
        Metadata::create([
            'title' => $pdf->getMetadata()['title'] ?? null,
            'author' => $pdf->getMetadata()['author'] ?? null,
            'user_id' => auth()->id(),
        ]);
    
        return redirect()->back()->with('success', 'PDF processed!');
    }
    
  2. Queueing Heavy Processing:

    public function processPdfJob(ProcessPdfJob $job) {
        $filePath = storage_path('app/uploads/' . $job->filename);
        $parser = new Parser();
        $pdf = $parser->parseFile($filePath);
    
        // Process text (e.g., OCR, search, or analysis)
        $job->updateProgress(100);
    }
    
  3. Service Provider Binding:

    // app/Providers/AppServiceProvider.php
    public function register() {
        $this->app->singleton(Parser::class, function () {
            return new Parser();
        });
    }
    

    Then inject Parser into controllers/services via constructor injection.

Advanced Patterns

  1. Custom Text Processing: Override PDFPage::getText() or use getTextArray() for granular control:

    $textArray = $page->getTextArray(); // Array of text chunks with formatting
    $cleanText = implode(' ', array_filter($textArray));
    
  2. Metadata Validation:

    $metadata = $pdf->getMetadata();
    $validated = [
        'author' => $metadata['author'] ?? null,
        'title'  => $metadata['title']  ?? 'Unnamed Document',
        'created' => $metadata['creationDate'] ?? now()->format('Y-m-d'),
    ];
    
  3. Error Handling: Wrap parsing in try-catch for malformed PDFs:

    try {
        $pdf = $parser->parseFile($filePath);
    } catch (\Smalot\PdfParser\Exceptions\ParseException $e) {
        \Log::error("PDF parsing failed: {$e->getMessage()}");
        return back()->with('error', 'Invalid PDF file.');
    }
    

Gotchas and Tips

Pitfalls

  1. Malformed PDFs:

    • The library may throw ParseException or InvalidArgumentException for corrupted files.
    • Fix: Validate files before processing (e.g., check MIME type or file signature).
    • Workaround: Use a try-catch block or pre-validate with:
      if (!\Smalot\PdfParser\Parser::isValid($filePath)) {
          throw new \InvalidArgumentException('Invalid PDF file.');
      }
      
  2. Encrypted PDFs:

    • The library ignores encryption (as of v2.8.0) but won’t throw errors.
    • Tip: Use a dedicated library like setasign/fpdf for decryption if needed.
  3. Memory Issues:

    • Large PDFs (e.g., >100MB) may hit memory limits.
    • Fix: Process pages incrementally or increase memory_limit in php.ini.
  4. Text Extraction Quirks:

    • Forms/Images: By default, forms and images are excluded from text extraction (since v2.12.1).
    • Formatting: Raw text may include formatting codes (e.g., \n, \t). Use getTextArray() for structured data.
  5. Unicode/Encoding:

    • Some PDFs use non-UTF-8 encodings (e.g., Mac Roman). The library handles this automatically, but results may vary.
    • Tip: Log extracted text to debug encoding issues:
      \Log::debug('Extracted text: ' . bin2hex($text));
      
  6. Zero-Length Streams:

    • Rarely, PDFs may contain empty streams, causing parsing to stall.
    • Fix: Upgrade to v2.12.3+ (includes a DoS fix for malformed PDFs).

Debugging Tips

  1. Log Raw Objects: Dump the parsed PDF object structure to debug:

    \Log::debug((string) $pdf); // Dumps headers and metadata
    
  2. Inspect Pages:

    foreach ($pdf->getPages() as $page) {
        \Log::debug([
            'Page ' . $page->getNumber(),
            'Text preview: ' . substr($page->getText(), 0, 200),
            'Width/Height: ' . $page->getWidth() . 'x' . $page->getHeight(),
        ]);
    }
    
  3. Check for Hidden Characters: Use trim() or preg_replace() to clean text:

    $cleanText = preg_replace('/[\x00-\x1F\x7F]/u', '', $text);
    

Extension Points

  1. Custom Configurations: Override default parsing behavior via CustomConfig:

    use Smalot\PdfParser\Config\CustomConfig;
    
    $config = new CustomConfig();
    $config->setOption('ignore_images', true); // Exclude images from text
    $config->setOption('ignore_forms', true);  // Exclude forms
    
    $parser = new Parser($config);
    $pdf = $parser->parseFile($filePath);
    

    See CustomConfig.md for all options.

  2. Extending PDFPage: Create a decorator to add custom methods:

    class EnhancedPdfPage extends \Smalot\PdfParser\PDFPage {
        public function getWordCount() {
            return str_word_count($this->getText());
        }
    }
    

    Then replace the parser’s page objects (advanced

Weaver

How can I help you explore Laravel packages today?

Conversation history is not saved when not logged in.
Prompt
Add packages to context
No packages found.
hamzi/corewatch
minionfactory/raw-hydrator
hexters/coinpayment
rjcodes/rjcms
act-training/laravel-permissions-manager
alimarchal/laravel-chart-of-accounts
babenkoivan/elastic-scout-driver
mkwebdesign/filament-watchdog-v5
renatomarinho/laravel-page-speed
zedmagdy/filament-business-hours
renatovdemoura/blade-elements-ui
devgeek/beacon-admin
benjamin-rqt/data-watcher-bundle
atriumphp/atrium
sandermuller/package-boost-laravel
sandermuller/boost-skills
redaxo/core
yusufgenc/filament-api-forge
l3aro/rating-star-for-filament
leek/filament-subtenant-scope