Getting Started

Minimal Setup in Laravel

Install via Composer:

composer require symfony/dom-crawler

First Use Case: Extracting Links from HTML

use Symfony\Component\DomCrawler\Crawler;

// Fetch HTML (e.g., from a Laravel HTTP client or file)
$html = file_get_contents('https://example.com');

// Create a Crawler instance
$crawler = new Crawler($html);

// Extract all links
$links = $crawler->filter('a')->each(function (Crawler $node) {
    return $node->attr('href');
});

Key Starting Points:

Official Docs: Symfony DomCrawler Documentation
Laravel Integration: Use with Guzzle or Illuminate\Support\Facades\Http for fetching HTML.
Common Selectors: Start with filter(), filterXPath(), and filterHtml().

Implementation Patterns

1. Chaining and Fluent API

Leverage method chaining for readability and maintainability:

$crawler->filter('article')->filter('.summary')->each(function (Crawler $node) {
    return $node->text();
});

2. Form Data Extraction

Extract form fields and values for scraping or testing:

$formData = $crawler->selectButton('Submit')->form()->getValues();

3. Integration with Laravel HTTP Client

Fetch and parse HTML in a single flow:

use Illuminate\Support\Facades\Http;

$response = Http::get('https://example.com');
$crawler = new Crawler($response->body());
$titles = $crawler->filter('h1')->text();

4. Batch Processing with Queues

Process large volumes of HTML asynchronously:

// Job: ParseHtmlJob.php
public function handle()
{
    $html = $this->fetchHtml(); // From DB, API, or file
    $crawler = new Crawler($html);
    $data = $crawler->filter('div.product')->each(fn($node) => [
        'name' => $node->filter('h2')->text(),
        'price' => $node->filter('.price')->text(),
    ]);
    // Store or process $data
}

5. Testing Workflows

Simulate user interactions in Laravel tests:

public function test_form_submission()
{
    $crawler = new Crawler($this->get('/form-page'));
    $form = $crawler->selectButton('Submit')->form();
    $form['email'] = 'test@example.com';
    $crawler->submit($form);

    $this->assertRouteIs('dashboard');
}

6. XPath for Complex Queries

Use XPath when CSS selectors are insufficient:

$nodes = $crawler->filterXPath('//div[@class="product" and contains(@id, "active")]');

7. Handling Malformed HTML

Leverage HTML5 parsing for robustness:

$crawler = new Crawler($html, 'https://example.com'); // Auto-detects charset

Gotchas and Tips

Common Pitfalls

XXE Vulnerabilities:
- Avoid parsing untrusted XML with addXmlContent(); use addHtmlContent() for HTML.
- Fix: Always validate XML input or disable validateOnParse (as in CVE-2026-45071).
Case-Sensitive Selectors:
- CSS selectors are case-sensitive. Use filter('a[href*="example"]') carefully.
- Tip: Normalize selectors with strtolower() if needed:
```
$crawler->filter(strtolower($selector));
```

Orphaned Nodes:

Malformed HTML may create orphaned nodes. Use filter()->count() to debug:

if ($crawler->filter('.missing-class')->count() === 0) {
    // Handle missing nodes
}

Attribute Extraction:
- attr() returns null if the attribute doesn’t exist. Use filter()->attr() or provide defaults:
```
$href = $crawler->filter('a')->attr('href', 'default-link.com');
```

Performance with Large Documents:

Avoid loading entire DOM into memory for huge files. Use filter() early to narrow results:

// Bad: Loads everything first
$crawler->filter('body')->filter('.target');

// Good: Narrow scope early
$crawler->filter('body .target');

Debugging Tips

Inspect Nodes:

$node = $crawler->filter('.target')->first();
dump($node->html()); // View raw HTML

Log Selectors:

$selector = '.product';
$count = $crawler->filter($selector)->count();
logger()->debug("Selector '$selector' matched $count nodes");

Use each() for Iteration:

Prefer each() over loops for cleaner iteration:

$data = $crawler->filter('div.item')->each(fn($node) => [
    'id' => $node->attr('id'),
    'text' => $node->text(),
]);

Extension Points

Custom Node Filters: Create reusable filter logic:

$crawler->filter(function (Crawler $node) {
    return $node->attr('data-role') === 'active';
});

Integrate with Laravel Collections: Convert results to collections for Laravel-friendly processing:

use Illuminate\Support\Collection;

$collection = collect($crawler->filter('li')->each(fn($node) => $node->text()));

Event Listeners for Crawling: Extend Crawler for project-specific logic:

class CustomCrawler extends Crawler {
    public function extractProductData() {
        return $this->filter('.product')->each(fn($node) => [
            'name' => $node->filter('h3')->text(),
            'price' => $node->filter('.price')->text(),
        ]);
    }
}

Configuration Quirks

Charset Handling:
- Always pass the base URL to Crawler for proper charset detection:
```
$crawler = new Crawler($html, 'https://example.com');
```
HTML5 vs. Legacy Parsing:
- Symfony 8+ uses PHP’s native HTML5 parser by default. For legacy behavior, use:
```
$crawler = new Crawler($html, null, null, ['html5' => false]);
```
Memory Limits:
- Large documents may hit memory limits. Use filter() to reduce the DOM size early:
```
$crawler->filter('body')->filter('.target'); // Narrow scope
```

Dom Crawler Laravel Package