## Getting Started

### Minimal Setup
1. **Installation**:
   ```bash
   composer require smalot/pdfparser

Ensure your project uses PHP 7.1+ (Laravel 5.5+ compatible).

First Use Case: Parse a PDF file and extract raw text:

use Smalot\PdfParser\Parser;

$parser = new Parser();
$pdf = $parser->parseFile(storage_path('app/uploads/document.pdf'));
$text = $pdf->getText();

Key Entry Points:
- Parser::parseFile(): Parse a local file.
- Parser::parseContent(): Parse raw PDF content (e.g., from a request).
- $pdf->getText(): Get concatenated text from all pages.
- $pdf->getPages(): Access individual pages as PDFPage objects.
Where to Look First:
- Usage.md for core methods.
- CustomConfig.md for advanced parsing tweaks.

Implementation Patterns

Core Workflows

1. Basic Text Extraction

$parser = new Parser();
$pdf = $parser->parseFile($filePath);

// Get all text (concatenated)
$fullText = $pdf->getText();

// Get text per page
foreach ($pdf->getPages() as $page) {
    $pageText = $page->getText();
    // Process per-page text (e.g., save to DB, search for keywords)
}

2. Metadata Extraction

$metadata = $pdf->getMetadata();
$author = $metadata['author'] ?? 'Unknown';
$title = $metadata['title'] ?? 'Untitled';

3. Page-Specific Processing

foreach ($pdf->getPages() as $index => $page) {
    $text = $page->getText();
    $width = $page->getWidth();  // Added in v2.10.0
    $height = $page->getHeight();

    // Example: Log text with page dimensions
    \Log::info("Page {$index}: Width={$width}px, Height={$height}px. Text length: " . strlen($text));
}

4. Handling Large PDFs

Use Parser::parseContent() for streamed uploads (e.g., from HTTP requests):

$parser = new Parser();
$pdf = $parser->parseContent($request->getContent());

Chunk Processing: For very large files, process pages incrementally:

$parser = new Parser();
$pdf = $parser->parseFile($largeFilePath);

foreach ($pdf->getPages() as $page) {
    $text = $page->getText();
    // Save to DB in batches or stream to a file
}

Integration Tips

Laravel-Specific Patterns

File Upload Handling:

public function upload(UploadRequest $request) {
    $file = $request->file('pdf');
    $parser = new Parser();
    $pdf = $parser->parseFile($file->path());

    // Store metadata in DB
    Metadata::create([
        'title' => $pdf->getMetadata()['title'] ?? null,
        'author' => $pdf->getMetadata()['author'] ?? null,
        'user_id' => auth()->id(),
    ]);

    return redirect()->back()->with('success', 'PDF processed!');
}

Queueing Heavy Processing:

public function processPdfJob(ProcessPdfJob $job) {
    $filePath = storage_path('app/uploads/' . $job->filename);
    $parser = new Parser();
    $pdf = $parser->parseFile($filePath);

    // Process text (e.g., OCR, search, or analysis)
    $job->updateProgress(100);
}

Service Provider Binding:

// app/Providers/AppServiceProvider.php
public function register() {
    $this->app->singleton(Parser::class, function () {
        return new Parser();
    });
}

Then inject Parser into controllers/services via constructor injection.

Advanced Patterns

Custom Text Processing: Override PDFPage::getText() or use getTextArray() for granular control:

$textArray = $page->getTextArray(); // Array of text chunks with formatting
$cleanText = implode(' ', array_filter($textArray));

Metadata Validation:

$metadata = $pdf->getMetadata();
$validated = [
    'author' => $metadata['author'] ?? null,
    'title'  => $metadata['title']  ?? 'Unnamed Document',
    'created' => $metadata['creationDate'] ?? now()->format('Y-m-d'),
];

Error Handling: Wrap parsing in try-catch for malformed PDFs:

try {
    $pdf = $parser->parseFile($filePath);
} catch (\Smalot\PdfParser\Exceptions\ParseException $e) {
    \Log::error("PDF parsing failed: {$e->getMessage()}");
    return back()->with('error', 'Invalid PDF file.');
}

Gotchas and Tips

Pitfalls

Malformed PDFs:
- The library may throw ParseException or InvalidArgumentException for corrupted files.
- Fix: Validate files before processing (e.g., check MIME type or file signature).
- Workaround: Use a try-catch block or pre-validate with:
```
if (!\Smalot\PdfParser\Parser::isValid($filePath)) {
    throw new \InvalidArgumentException('Invalid PDF file.');
}
```
Encrypted PDFs:
- The library ignores encryption (as of v2.8.0) but won’t throw errors.
- Tip: Use a dedicated library like setasign/fpdf for decryption if needed.
Memory Issues:
- Large PDFs (e.g., >100MB) may hit memory limits.
- Fix: Process pages incrementally or increase memory_limit in php.ini.
Text Extraction Quirks:
- Forms/Images: By default, forms and images are excluded from text extraction (since v2.12.1).
- Formatting: Raw text may include formatting codes (e.g., \n, \t). Use getTextArray() for structured data.
Unicode/Encoding:
- Some PDFs use non-UTF-8 encodings (e.g., Mac Roman). The library handles this automatically, but results may vary.
- Tip: Log extracted text to debug encoding issues:
```
\Log::debug('Extracted text: ' . bin2hex($text));
```
Zero-Length Streams:
- Rarely, PDFs may contain empty streams, causing parsing to stall.
- Fix: Upgrade to v2.12.3+ (includes a DoS fix for malformed PDFs).

Debugging Tips

Log Raw Objects: Dump the parsed PDF object structure to debug:
```
\Log::debug((string) $pdf); // Dumps headers and metadata
```

Inspect Pages:

foreach ($pdf->getPages() as $page) {
    \Log::debug([
        'Page ' . $page->getNumber(),
        'Text preview: ' . substr($page->getText(), 0, 200),
        'Width/Height: ' . $page->getWidth() . 'x' . $page->getHeight(),
    ]);
}

Check for Hidden Characters: Use trim() or preg_replace() to clean text:
```
$cleanText = preg_replace('/[\x00-\x1F\x7F]/u', '', $text);
```

Extension Points

Custom Configurations: Override default parsing behavior via CustomConfig:

use Smalot\PdfParser\Config\CustomConfig;

$config = new CustomConfig();
$config->setOption('ignore_images', true); // Exclude images from text
$config->setOption('ignore_forms', true);  // Exclude forms

$parser = new Parser($config);
$pdf = $parser->parseFile($filePath);

See CustomConfig.md for all options.

Extending PDFPage: Create a decorator to add custom methods:

class EnhancedPdfPage extends \Smalot\PdfParser\PDFPage {
    public function getWordCount() {
        return str_word_count($this->getText());
    }
}

Then replace the parser’s page objects (advanced

Pdfparser Laravel Package