Weave Code
Code Weaver
Helps Laravel developers discover, compare, and choose open-source packages. See popularity, security, maintainers, and scores at a glance to make better decisions.
Feedback
Share your thoughts, report bugs, or suggest improvements.
Subject
Message

Dom Crawler Laravel Package

symfony/dom-crawler

Symfony DomCrawler makes it easy to parse and navigate HTML/XML documents. It provides a fluent API to filter elements, extract text/attributes, follow links and forms, and integrates well with HttpClient and BrowserKit for web scraping and testing.

View on GitHub
Deep Wiki
Context7

Getting Started

Minimal Setup

  1. Installation:

    composer require symfony/dom-crawler
    

    No additional configuration is required—it’s a standalone component.

  2. First Use Case: Parse a simple HTML string or URL response:

    use Symfony\Component\DomCrawler\Crawler;
    
    $html = '<html><body><h1>Title</h1><p>Content</p></body></html>';
    $crawler = new Crawler($html);
    
    // Extract the title text
    $title = $crawler->filter('h1')->text();
    // Output: "Title"
    
  3. Where to Look First:

    • Official Documentation (API reference, selectors, traversal).
    • Laravel Integration: Use with Laravel’s Http client for scraping live pages:
      use Illuminate\Support\Facades\Http;
      use Symfony\Component\DomCrawler\Crawler;
      
      $response = Http::get('https://example.com');
      $crawler = new Crawler($response->body());
      

Implementation Patterns

Core Workflows

1. Scraping Structured Data

Extract tables, lists, or forms from HTML:

// Extract all table rows
$rows = $crawler->filter('table tr')->each(function (Crawler $node) {
    return $node->filter('td')->text();
});

// Example output: [['Col1', 'Col2'], ['Data1', 'Data2'], ...]

2. Form Submission Simulation

Automate interactions with legacy forms:

$form = $crawler->selectButton('Submit')->form();
$form['username'] = 'test';
$form['password'] = 'secret';
$client = new Client(); // Symfony BrowserKit or Laravel's HttpTestResponse
$response = $client->submit($form);

3. XPath for Complex Queries

Leverage XPath when CSS selectors fall short:

$nodes = $crawler->filterXPath('//div[@class="product" and @data-id]');

4. Laravel Service Wrapper

Encapsulate logic in a service for reusability:

namespace App\Services;

use Symfony\Component\DomCrawler\Crawler;
use Illuminate\Support\Facades\Http;

class ScraperService {
    public function scrapeProductData(string $url): array {
        $response = Http::get($url);
        $crawler = new Crawler($response->body());

        return $crawler->filter('.product')->each(function (Crawler $node) {
            return [
                'name' => $node->filter('h2')->text(),
                'price' => $node->filter('.price')->text(),
            ];
        });
    }
}

5. Testing Rendered Views

Assert DOM structure in Laravel tests:

use Illuminate\Foundation\Testing\TestCase;
use Symfony\Component\DomCrawler\Crawler;

public function testHomepageStructure() {
    $response = $this->get('/');
    $crawler = new Crawler($response->getContent());

    $this->assertEquals('Welcome', $crawler->filter('h1')->text());
    $this->assertCount(3, $crawler->filter('.feature'));
}

6. Handling Pagination

Loop through paginated results:

$baseUrl = 'https://example.com/page/';
$data = [];

for ($i = 1; $i <= 5; $i++) {
    $response = Http::get($baseUrl . $i);
    $crawler = new Crawler($response->body());
    $data[] = $crawler->filter('.item')->extract(['href', 'title']);
}

7. XML Parsing

Parse XML feeds (e.g., RSS):

$xml = '<root><item><title>Test</title></item></root>';
$crawler = new Crawler($xml);
$titles = $crawler->filter('item title')->text();

Integration Tips

With Laravel HTTP Client

Combine with Laravel’s Http facade for seamless requests:

use Illuminate\Support\Facades\Http;
use Symfony\Component\DomCrawler\Crawler;

$response = Http::withOptions(['verify' => false])->get('https://example.com');
$crawler = new Crawler($response->body());

With Guzzle

For advanced HTTP features (e.g., cookies, headers):

use GuzzleHttp\Client;
use Symfony\Component\DomCrawler\Crawler;

$client = new Client();
$response = $client->request('GET', 'https://example.com', [
    'headers' => ['User-Agent' => 'Mozilla/5.0'],
]);
$crawler = new Crawler($response->getBody());

With Queue Jobs

Offload scraping to background jobs:

namespace App\Jobs;

use Illuminate\Bus\Queueable;
use Symfony\Component\DomCrawler\Crawler;
use Illuminate\Support\Facades\Http;

class ScrapeJob implements Queueable {
    public function handle() {
        $response = Http::get('https://example.com');
        $crawler = new Crawler($response->body());
        // Process data...
    }
}

With Laravel Blade

Pre-render HTML for testing:

$html = $this->blade->render('@include('partials.product-card')', ['product' => $product]);
$crawler = new Crawler($html);

Gotchas and Tips

Pitfalls

1. Malformed HTML

  • Issue: DomCrawler may throw errors or behave unpredictably with broken HTML.
  • Fix: Use PHP’s libxml_use_internal_errors(true) or Symfony’s Html5Parser (PHP 8.4+):
    // For PHP 8.4+
    $crawler = new Crawler($html, 'https://example.com');
    // Or pre-process with Tidy:
    $tidy = new \tidy;
    $tidy->parseString($html, $config, 'utf8');
    $crawler = new Crawler($tidy->cleanRepair());
    

2. Case Sensitivity in Selectors

  • Issue: CSS selectors are case-sensitive in HTML (unlike XPath).
  • Fix: Normalize selectors or use XPath for case-insensitive queries:
    // CSS (case-sensitive)
    $crawler->filter('div.classname');
    
    // XPath (case-insensitive)
    $crawler->filterXPath('//div[contains(translate(@class, "ABCDEFGHIJKLMNOPQRSTUVWXYZ", "abcdefghijklmnopqrstuvwxyz"), "classname")]');
    

3. Relative URLs

  • Issue: Links extracted with link() or attr('href') may be relative.
  • Fix: Use Crawler::filter() with baseUri or resolve manually:
    $crawler = new Crawler($html, 'https://example.com');
    $links = $crawler->filter('a')->links();
    // Output: ['https://example.com/absolute-link', ...]
    

4. Memory Usage

  • Issue: Large HTML documents consume significant memory.
  • Fix: Process in chunks or use streaming:
    $crawler = new Crawler($html, 'https://example.com');
    $crawler->filter('div.item')->each(function (Crawler $node) {
        // Process one item at a time
    });
    

5. Dynamic Content

  • Issue: DomCrawler cannot parse JavaScript-rendered content.
  • Fix: Use headless browsers (e.g., Symfony Panther or Puppeteer):
    use Symfony\Component\Panther\Client;
    
    $client = Client::createChromeClient();
    $crawler = $client->request('GET', 'https://example.com');
    

6. Encoding Issues

  • Issue: Non-UTF-8 content may cause parsing errors.
  • Fix: Specify encoding or pre-convert:
    $html = mb_convert_encoding($html, 'UTF-8', 'ISO-8859-1');
    $crawler = new Crawler($html);
    

Debugging Tips

1. Inspect the DOM

Use html() to debug the parsed structure:

$crawler = new Crawler($html);
dd($crawler->html()); // Dump raw HTML for inspection

2. Validate Selectors

Test selectors in browser DevTools first, then adapt for DomCrawler:

// Test if selector matches anything
if ($crawler
Weaver

How can I help you explore Laravel packages today?

Conversation history is not saved when not logged in.
Prompt
Add packages to context
No packages found.
davejamesmiller/laravel-breadcrumbs
artisanry/parsedown
christhompsontldr/phpsdk
enqueue/dsn
bunny/bunny
enqueue/test
enqueue/null
enqueue/amqp-tools
milesj/emojibase
bower-asset/punycode
bower-asset/inputmask
bower-asset/jquery
bower-asset/yii2-pjax
laravel/nova
spatie/laravel-mailcoach
spatie/laravel-superseeder
laravel/liferaft
nst/json-test-suite
danielmiessler/sec-lists
jackalope/jackalope-transport