smalot/pdfparser
Standalone PHP PDF parsing library to extract text, pages, and metadata from PDFs. Supports compressed PDFs and various encodings, with configurable parsing options. Note: secured PDFs and form data extraction are not supported.
Installation:
composer require smalot/pdfparser
Ensure PHP 7.1+ is used.
Basic Parsing:
$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile('path/to/document.pdf');
$text = $pdf->getText();
First Use Case: Extract text from a PDF file and store it in a database or process it for further use (e.g., search, analysis).
Text Extraction:
// Extract text from all pages
$text = $pdf->getText();
// Extract text from specific pages (e.g., first 5 pages)
$text = $pdf->getText(5);
// Extract text from a specific page
$text = $pdf->getPages()[0]->getText();
Metadata Extraction:
$metadata = $pdf->getDetails();
// Example: $metadata['Author'], $metadata['Title']
Page-Specific Operations:
foreach ($pdf->getPages() as $page) {
$pageText = $page->getText();
$pageDetails = $page->getDetails();
}
Base64-Encoded PDFs:
$pdf = $parser->parseContent(base64_decode($base64PdfString));
Text Positioning and Font Info:
$config = new \Smalot\PdfParser\Config();
$config->setDataTmFontInfoHasToBeIncluded(true);
$parser = new \Smalot\PdfParser\Parser([], $config);
$pdf = $parser->parseFile('document.pdf');
$data = $pdf->getPages()[0]->getDataTm();
Memory Management:
$config = new \Smalot\PdfParser\Config();
$config->setRetainImageContent(false);
$config->setDecodeMemoryLimit(1000000); // 1MB
$parser = new \Smalot\PdfParser\Parser([], $config);
Laravel Service Provider:
Register the parser as a singleton in AppServiceProvider for easy access:
public function register()
{
$this->app->singleton(\Smalot\PdfParser\Parser::class, function ($app) {
return new \Smalot\PdfParser\Parser();
});
}
Command Bus for Batch Processing: Use Laravel's command bus to process multiple PDFs asynchronously:
$pdfs = Storage::disk('local')->files('pdfs/');
foreach ($pdfs as $pdf) {
$this->dispatch(new ParsePdfJob($pdf));
}
Event Listeners:
Trigger events after parsing (e.g., PdfParsed event) to process extracted data:
class PdfParsedListener
{
public function handle(PdfParsed $event)
{
// Process $event->text or $event->metadata
}
}
Queue Jobs for Large Files: Offload parsing of large PDFs to a queue:
class ParsePdfJob implements ShouldQueue
{
use Dispatchable, InteractsWithQueue, Queueable;
public function handle()
{
$parser = app(\Smalot\PdfParser\Parser::class);
$pdf = $parser->parseFile(storage_path('app/' . $this->pdfPath));
// Process $pdf
}
}
Encrypted PDFs:
setIgnoreEncryption(true) cautiously (only for unencrypted PDFs marked as encrypted).Exception: Secured pdf file are currently not supported.Memory Exhaustion:
setRetainImageContent(false) and setDecodeMemoryLimit() to mitigate this.Allowed memory size of X bytes exhausted.Text Formatting Issues:
setFontSpaceLimit() and setHorizontalOffset():
$config->setFontSpaceLimit(-60); // Reduce spaces
$config->setHorizontalOffset("\t"); // Preserve structure
XMP Metadata:
if (is_array($metadata['dc:creator'])) {
// Handle array of creators
}
Page Details:
getDetails() may not always include MediaBox. Fall back to header details if missing:
if (!isset($details['MediaBox'])) {
$pages = $pdf->getObjectsByType('Pages');
$details = reset($pages)->getHeader()->getDetails();
}
Font Width Calculation:
calculateTextWidth() will populate the $missing array. Handle missing characters gracefully:
$width = $font->calculateTextWidth('Some text', $missing);
if (!empty($missing)) {
// Log or handle missing characters
}
Log Parsing Errors:
try {
$pdf = $parser->parseFile($path);
} catch (\Exception $e) {
Log::error("PDF Parsing Error: " . $e->getMessage());
}
Inspect PDF Structure:
getObjectsByType('Pages') to debug page structures:
$pages = $pdf->getObjectsByType('Pages');
foreach ($pages as $page) {
$header = $page->getHeader();
// Inspect $header details
}
Verify Configurations:
$config = new \Smalot\PdfParser\Config();
$config->setFontSpaceLimit(-60);
$config->setHorizontalOffset("\t");
$parser = new \Smalot\PdfParser\Parser([], $config);
Test with Sample PDFs:
setIgnoreEncryption)Custom Configurations:
Config class to add new parsing behaviors:
class CustomConfig extends \Smalot\PdfParser\Config
{
public function setCustomOption($value)
{
// Add custom logic
}
}
Post-Processing Hooks:
Pdf class to add post-processing logic:
class EnhancedPdf
{
protected $pdf;
public function __construct(\Smalot\PdfParser\Pdf $pdf)
{
$this->pdf = $pdf;
}
public function getProcessedText()
{
$text = $this->pdf->getText();
return $this->cleanText($text); // Custom cleaning logic
}
}
Event-Driven Extensions:
PageParsed, TextExtracted) using Laravel's event system:
event(new TextExtracted($text, $pageNumber));
Integration with OCR:
spatie/pdf-to-text or Tesseract) for scanned PDFs:
if ($pdf->isScanned()) {
$text = $this->ocrService->extractText($pdfPath);
} else {
$text = $pdf->getText();
}
How can I help you explore Laravel packages today?