Weave Code
Code Weaver
Helps Laravel developers discover, compare, and choose open-source packages. See popularity, security, maintainers, and scores at a glance to make better decisions.
Feedback
Share your thoughts, report bugs, or suggest improvements.
Subject
Message

Pdfparser Laravel Package

smalot/pdfparser

Standalone PHP PDF parsing library to extract text, pages, and metadata from PDFs. Supports compressed PDFs and various encodings, with configurable parsing options. Note: secured PDFs and form data extraction are not supported.

View on GitHub
Deep Wiki
Context7

Getting Started

Minimal Steps

  1. Installation:

    composer require smalot/pdfparser
    

    Ensure PHP 7.1+ is used.

  2. Basic Parsing:

    $parser = new \Smalot\PdfParser\Parser();
    $pdf = $parser->parseFile('path/to/document.pdf');
    $text = $pdf->getText();
    
  3. First Use Case: Extract text from a PDF file and store it in a database or process it for further use (e.g., search, analysis).


Implementation Patterns

Common Workflows

  1. Text Extraction:

    // Extract text from all pages
    $text = $pdf->getText();
    
    // Extract text from specific pages (e.g., first 5 pages)
    $text = $pdf->getText(5);
    
    // Extract text from a specific page
    $text = $pdf->getPages()[0]->getText();
    
  2. Metadata Extraction:

    $metadata = $pdf->getDetails();
    // Example: $metadata['Author'], $metadata['Title']
    
  3. Page-Specific Operations:

    foreach ($pdf->getPages() as $page) {
        $pageText = $page->getText();
        $pageDetails = $page->getDetails();
    }
    
  4. Base64-Encoded PDFs:

    $pdf = $parser->parseContent(base64_decode($base64PdfString));
    
  5. Text Positioning and Font Info:

    $config = new \Smalot\PdfParser\Config();
    $config->setDataTmFontInfoHasToBeIncluded(true);
    $parser = new \Smalot\PdfParser\Parser([], $config);
    $pdf = $parser->parseFile('document.pdf');
    $data = $pdf->getPages()[0]->getDataTm();
    
  6. Memory Management:

    $config = new \Smalot\PdfParser\Config();
    $config->setRetainImageContent(false);
    $config->setDecodeMemoryLimit(1000000); // 1MB
    $parser = new \Smalot\PdfParser\Parser([], $config);
    

Integration Tips

  • Laravel Service Provider: Register the parser as a singleton in AppServiceProvider for easy access:

    public function register()
    {
        $this->app->singleton(\Smalot\PdfParser\Parser::class, function ($app) {
            return new \Smalot\PdfParser\Parser();
        });
    }
    
  • Command Bus for Batch Processing: Use Laravel's command bus to process multiple PDFs asynchronously:

    $pdfs = Storage::disk('local')->files('pdfs/');
    foreach ($pdfs as $pdf) {
        $this->dispatch(new ParsePdfJob($pdf));
    }
    
  • Event Listeners: Trigger events after parsing (e.g., PdfParsed event) to process extracted data:

    class PdfParsedListener
    {
        public function handle(PdfParsed $event)
        {
            // Process $event->text or $event->metadata
        }
    }
    
  • Queue Jobs for Large Files: Offload parsing of large PDFs to a queue:

    class ParsePdfJob implements ShouldQueue
    {
        use Dispatchable, InteractsWithQueue, Queueable;
    
        public function handle()
        {
            $parser = app(\Smalot\PdfParser\Parser::class);
            $pdf = $parser->parseFile(storage_path('app/' . $this->pdfPath));
            // Process $pdf
        }
    }
    

Gotchas and Tips

Pitfalls

  1. Encrypted PDFs:

    • The library does not support encrypted PDFs by default. Use setIgnoreEncryption(true) cautiously (only for unencrypted PDFs marked as encrypted).
    • Error: Exception: Secured pdf file are currently not supported.
  2. Memory Exhaustion:

    • Large PDFs or those with high-resolution images may cause memory issues. Use setRetainImageContent(false) and setDecodeMemoryLimit() to mitigate this.
    • Example error: Allowed memory size of X bytes exhausted.
  3. Text Formatting Issues:

    • Excessive spaces or broken words may occur. Adjust setFontSpaceLimit() and setHorizontalOffset():
      $config->setFontSpaceLimit(-60); // Reduce spaces
      $config->setHorizontalOffset("\t"); // Preserve structure
      
  4. XMP Metadata:

    • XMP metadata may return nested arrays. Handle multi-level data structures carefully:
      if (is_array($metadata['dc:creator'])) {
          // Handle array of creators
      }
      
  5. Page Details:

    • getDetails() may not always include MediaBox. Fall back to header details if missing:
      if (!isset($details['MediaBox'])) {
          $pages = $pdf->getObjectsByType('Pages');
          $details = reset($pages)->getHeader()->getDetails();
      }
      
  6. Font Width Calculation:

    • Missing character widths in calculateTextWidth() will populate the $missing array. Handle missing characters gracefully:
      $width = $font->calculateTextWidth('Some text', $missing);
      if (!empty($missing)) {
          // Log or handle missing characters
      }
      

Debugging Tips

  1. Log Parsing Errors:

    • Wrap parsing in a try-catch block to log errors:
      try {
          $pdf = $parser->parseFile($path);
      } catch (\Exception $e) {
          Log::error("PDF Parsing Error: " . $e->getMessage());
      }
      
  2. Inspect PDF Structure:

    • Use getObjectsByType('Pages') to debug page structures:
      $pages = $pdf->getObjectsByType('Pages');
      foreach ($pages as $page) {
          $header = $page->getHeader();
          // Inspect $header details
      }
      
  3. Verify Configurations:

    • Double-check config settings before parsing:
      $config = new \Smalot\PdfParser\Config();
      $config->setFontSpaceLimit(-60);
      $config->setHorizontalOffset("\t");
      $parser = new \Smalot\PdfParser\Parser([], $config);
      
  4. Test with Sample PDFs:

    • Use publicly available PDFs (e.g., from PDF Association) to test edge cases like:
      • Complex layouts
      • Encrypted files (with setIgnoreEncryption)
      • Large files (with memory limits)

Extension Points

  1. Custom Configurations:

    • Extend the Config class to add new parsing behaviors:
      class CustomConfig extends \Smalot\PdfParser\Config
      {
          public function setCustomOption($value)
          {
              // Add custom logic
          }
      }
      
  2. Post-Processing Hooks:

    • Create a decorator for the Pdf class to add post-processing logic:
      class EnhancedPdf
      {
          protected $pdf;
      
          public function __construct(\Smalot\PdfParser\Pdf $pdf)
          {
              $this->pdf = $pdf;
          }
      
          public function getProcessedText()
          {
              $text = $this->pdf->getText();
              return $this->cleanText($text); // Custom cleaning logic
          }
      }
      
  3. Event-Driven Extensions:

    • Dispatch events during parsing (e.g., PageParsed, TextExtracted) using Laravel's event system:
      event(new TextExtracted($text, $pageNumber));
      
  4. Integration with OCR:

    • Combine with OCR libraries (e.g., spatie/pdf-to-text or Tesseract) for scanned PDFs:
      if ($pdf->isScanned()) {
          $text = $this->ocrService->extractText($pdfPath);
      } else {
          $text = $pdf->getText();
      }
      
Weaver

How can I help you explore Laravel packages today?

Conversation history is not saved when not logged in.
Prompt
Add packages to context
No packages found.
davejamesmiller/laravel-breadcrumbs
artisanry/parsedown
christhompsontldr/phpsdk
enqueue/dsn
bunny/bunny
enqueue/test
enqueue/null
enqueue/amqp-tools
milesj/emojibase
bower-asset/punycode
bower-asset/inputmask
bower-asset/jquery
bower-asset/yii2-pjax
laravel/nova
spatie/laravel-mailcoach
spatie/laravel-superseeder
laravel/liferaft
nst/json-test-suite
danielmiessler/sec-lists
jackalope/jackalope-transport