smalot/pdfparser
Standalone PHP library to parse PDF files and extract content. Reads objects/headers, metadata, and ordered page text; supports compressed PDFs and various encodings. Configure parsing via custom configs. Note: no support for secured PDFs or form data.
## Getting Started
### Minimal Setup
1. **Installation**:
```bash
composer require smalot/pdfparser
Ensure your project uses PHP 7.1+ (Laravel 5.5+ compatible).
First Use Case: Parse a PDF file and extract raw text:
use Smalot\PdfParser\Parser;
$parser = new Parser();
$pdf = $parser->parseFile(storage_path('app/uploads/document.pdf'));
$text = $pdf->getText();
Key Entry Points:
Parser::parseFile(): Parse a local file.Parser::parseContent(): Parse raw PDF content (e.g., from a request).$pdf->getText(): Get concatenated text from all pages.$pdf->getPages(): Access individual pages as PDFPage objects.Where to Look First:
$parser = new Parser();
$pdf = $parser->parseFile($filePath);
// Get all text (concatenated)
$fullText = $pdf->getText();
// Get text per page
foreach ($pdf->getPages() as $page) {
$pageText = $page->getText();
// Process per-page text (e.g., save to DB, search for keywords)
}
$metadata = $pdf->getMetadata();
$author = $metadata['author'] ?? 'Unknown';
$title = $metadata['title'] ?? 'Untitled';
foreach ($pdf->getPages() as $index => $page) {
$text = $page->getText();
$width = $page->getWidth(); // Added in v2.10.0
$height = $page->getHeight();
// Example: Log text with page dimensions
\Log::info("Page {$index}: Width={$width}px, Height={$height}px. Text length: " . strlen($text));
}
Use Parser::parseContent() for streamed uploads (e.g., from HTTP requests):
$parser = new Parser();
$pdf = $parser->parseContent($request->getContent());
Chunk Processing: For very large files, process pages incrementally:
$parser = new Parser();
$pdf = $parser->parseFile($largeFilePath);
foreach ($pdf->getPages() as $page) {
$text = $page->getText();
// Save to DB in batches or stream to a file
}
File Upload Handling:
public function upload(UploadRequest $request) {
$file = $request->file('pdf');
$parser = new Parser();
$pdf = $parser->parseFile($file->path());
// Store metadata in DB
Metadata::create([
'title' => $pdf->getMetadata()['title'] ?? null,
'author' => $pdf->getMetadata()['author'] ?? null,
'user_id' => auth()->id(),
]);
return redirect()->back()->with('success', 'PDF processed!');
}
Queueing Heavy Processing:
public function processPdfJob(ProcessPdfJob $job) {
$filePath = storage_path('app/uploads/' . $job->filename);
$parser = new Parser();
$pdf = $parser->parseFile($filePath);
// Process text (e.g., OCR, search, or analysis)
$job->updateProgress(100);
}
Service Provider Binding:
// app/Providers/AppServiceProvider.php
public function register() {
$this->app->singleton(Parser::class, function () {
return new Parser();
});
}
Then inject Parser into controllers/services via constructor injection.
Custom Text Processing:
Override PDFPage::getText() or use getTextArray() for granular control:
$textArray = $page->getTextArray(); // Array of text chunks with formatting
$cleanText = implode(' ', array_filter($textArray));
Metadata Validation:
$metadata = $pdf->getMetadata();
$validated = [
'author' => $metadata['author'] ?? null,
'title' => $metadata['title'] ?? 'Unnamed Document',
'created' => $metadata['creationDate'] ?? now()->format('Y-m-d'),
];
Error Handling: Wrap parsing in try-catch for malformed PDFs:
try {
$pdf = $parser->parseFile($filePath);
} catch (\Smalot\PdfParser\Exceptions\ParseException $e) {
\Log::error("PDF parsing failed: {$e->getMessage()}");
return back()->with('error', 'Invalid PDF file.');
}
Malformed PDFs:
ParseException or InvalidArgumentException for corrupted files.if (!\Smalot\PdfParser\Parser::isValid($filePath)) {
throw new \InvalidArgumentException('Invalid PDF file.');
}
Encrypted PDFs:
setasign/fpdf for decryption if needed.Memory Issues:
memory_limit in php.ini.Text Extraction Quirks:
\n, \t). Use getTextArray() for structured data.Unicode/Encoding:
\Log::debug('Extracted text: ' . bin2hex($text));
Zero-Length Streams:
Log Raw Objects: Dump the parsed PDF object structure to debug:
\Log::debug((string) $pdf); // Dumps headers and metadata
Inspect Pages:
foreach ($pdf->getPages() as $page) {
\Log::debug([
'Page ' . $page->getNumber(),
'Text preview: ' . substr($page->getText(), 0, 200),
'Width/Height: ' . $page->getWidth() . 'x' . $page->getHeight(),
]);
}
Check for Hidden Characters:
Use trim() or preg_replace() to clean text:
$cleanText = preg_replace('/[\x00-\x1F\x7F]/u', '', $text);
Custom Configurations:
Override default parsing behavior via CustomConfig:
use Smalot\PdfParser\Config\CustomConfig;
$config = new CustomConfig();
$config->setOption('ignore_images', true); // Exclude images from text
$config->setOption('ignore_forms', true); // Exclude forms
$parser = new Parser($config);
$pdf = $parser->parseFile($filePath);
See CustomConfig.md for all options.
Extending PDFPage: Create a decorator to add custom methods:
class EnhancedPdfPage extends \Smalot\PdfParser\PDFPage {
public function getWordCount() {
return str_word_count($this->getText());
}
}
Then replace the parser’s page objects (advanced
How can I help you explore Laravel packages today?