Weave Code
Code Weaver
Helps Laravel developers discover, compare, and choose open-source packages. See popularity, security, maintainers, and scores at a glance to make better decisions.
Feedback
Share your thoughts, report bugs, or suggest improvements.
Subject
Message

Dom Crawler Laravel Package

symfony/dom-crawler

Symfony DomCrawler makes it easy to navigate and query HTML/XML DOMs using CSS selectors and XPath. Extract links, forms, and text, filter nodes, and chain queries for robust scraping, testing, and content parsing in PHP.

View on GitHub
Deep Wiki
Context7

Getting Started

Minimal Steps

  1. Installation:

    composer require symfony/dom-crawler
    

    No additional configuration is required—it’s a standalone component.

  2. First Use Case: Parse a simple HTML string and extract data:

    use Symfony\Component\DomCrawler\Crawler;
    
    $html = '<html><body><h1>Title</h1><p>Content</p></body></html>';
    $crawler = new Crawler($html);
    
    // Extract the title
    $title = $crawler->filter('h1')->text();
    // Output: "Title"
    
  3. Where to Look First:

    • Official Documentation (API reference, examples).
    • GitHub Issues for edge cases (e.g., malformed HTML).
    • Laravel-specific integrations (e.g., pairing with Guzzle for HTTP requests).

Implementation Patterns

Core Workflows

1. Scraping Data from Web Pages

  • Fetch HTML (e.g., with Guzzle):
    $client = new \GuzzleHttp\Client();
    $response = $client->request('GET', 'https://example.com');
    $html = $response->getBody()->getContents();
    
  • Parse and Extract:
    $crawler = new Crawler($html);
    $links = $crawler->filter('a')->links();
    foreach ($links as $link) {
        echo $link->getUri() . "\n";
    }
    

2. Form Interaction Automation

  • Extract form fields and submit data:
    $crawler = new Crawler($html);
    $form = $crawler->selectButton('Submit')->form();
    $form['username'] = 'test';
    $form['password'] = 'secret';
    $client->submit($form);
    

3. XPath for Complex Queries

  • Use XPath for precise node selection:
    $nodes = $crawler->filterXPath('//div[@class="product"]//span[@itemprop="price"]');
    foreach ($nodes as $node) {
        echo $node->textContent . "\n";
    }
    

4. Iterating Over Node Collections

  • Chain methods for fluent traversal:
    $crawler->filter('table tr')
        ->each(function (Crawler $row) {
            echo $row->filter('td')->text() . "\n";
        });
    

5. Laravel Integration

  • Service Provider Binding:
    // app/Providers/AppServiceProvider.php
    public function register()
    {
        $this->app->singleton(Crawler::class, function () {
            return new Crawler();
        });
    }
    
  • HTTP Client Middleware (for scraping):
    $client = new \GuzzleHttp\Client([
        'on_request' => function ($request) {
            $request->getHeaders()->set('User-Agent', 'Mozilla/5.0');
        }
    ]);
    

Integration Tips

  1. Handle Malformed HTML: Use the native HTML5 parser (PHP 8.4+) for robustness:

    $crawler = new Crawler($html, 'https://example.com');
    
  2. Rate Limiting: Add delays between requests to avoid IP bans:

    $client->getConfig(['delay' => 2]); // 2-second delay
    
  3. Store Parsed Data: Use Laravel’s Eloquent or collections to persist results:

    $products = $crawler->filter('.product')->map(function (Crawler $node) {
        return [
            'name' => $node->filter('h2')->text(),
            'price' => $node->filter('.price')->text(),
        ];
    });
    Product::insert($products->toArray());
    
  4. Testing: Mock HTML responses in PHPUnit:

    $crawler = new Crawler('<html><body><div>Test</div></body></html>');
    $this->assertEquals('Test', $crawler->filter('div')->text());
    

Gotchas and Tips

Pitfalls

  1. XXE Vulnerabilities:

    • Avoid parsing untrusted XML with addXmlContent() (fixed in v8.0.12+):
      // UNSAFE: Disabled validateOnParse by default in newer versions
      $crawler = new Crawler();
      $crawler->addXmlContent('<root><![CDATA[<script>alert(1)</script>]]></root>');
      
    • Fix: Sanitize input or use addHtmlContent() for HTML.
  2. Case-Sensitive Selectors:

    • CSS selectors are case-insensitive by default, but XPath is case-sensitive:
      // Works (CSS)
      $crawler->filter('div.className');
      
      // May fail (XPath)
      $crawler->filterXPath('//DIV[@class="className"]');
      
  3. Orphaned Nodes:

    • Malformed HTML may create orphaned branches. Use filter() carefully:
      // May throw if parent is null
      $crawler->filter('div > span')->each(...);
      
    • Fix: Validate HTML structure or use filterXPath('//span') for broader matching.
  4. Memory Leaks:

    • Large crawls (e.g., 10K+ pages) can bloat memory. Use generators:
      $client = new \GuzzleHttp\Client();
      $urls = ['url1', 'url2', ...];
      foreach ($urls as $url) {
          $html = $client->get($url)->getBody();
          yield new Crawler($html);
      }
      
  5. Attribute Selection:

    • Some attributes (e.g., value on <button>) require explicit handling:
      $value = $crawler->filter('button')->attr('value'); // Works in v7.3.1+
      

Debugging Tips

  1. Inspect Nodes: Use html() to debug rendered output:

    echo $crawler->filter('div')->html();
    
  2. Log Queries: Add logging for failed selectors:

    if ($crawler->filter('selector')->count() === 0) {
        Log::warning('Selector not found: selector', ['crawler' => $crawler->html()]);
    }
    
  3. Validate HTML: Use tools like W3C Validator to pre-check input.


Extension Points

  1. Custom Crawler Classes: Extend Crawler for reusable logic:

    class ProductCrawler extends Crawler {
        public function extractProducts() {
            return $this->filter('.product')->map(...);
        }
    }
    
  2. Event Listeners: Hook into Symfony’s event system (if using Symfony) to modify crawler behavior.

  3. Laravel Service Providers: Bind custom crawler instances:

    $this->app->bind(ProductCrawler::class, function () {
        return new ProductCrawler();
    });
    
  4. Testing Helpers: Create a base test case:

    abstract class CrawlerTestCase extends TestCase {
        protected function assertCrawlerHasText(Crawler $crawler, string $selector, string $text) {
            $this->assertEquals($text, $crawler->filter($selector)->text());
        }
    }
    

Performance Quirks

  1. XPath vs. CSS:

    • XPath is slower but more powerful. Prefer CSS for simple queries:
      // Faster
      $crawler->filter('div.product');
      
      // Slower (but flexible)
      $crawler->filterXPath('//div[contains(@class, "product")]');
      
  2. Reuse Crawlers: Avoid recreating crawlers for small changes:

    // Bad: Creates new crawler per iteration
    foreach ($urls as $url) {
        $crawler = new Crawler($html);
        // ...
    }
    
    // Good: Reuse crawler
    $crawler = new Crawler();
    foreach ($urls as $url) {
        $crawler->addHtmlContent($html);
        // ...
    }
    
  3. Parallel Processing: Use Laravel’s queues or Symfony’s Messenger for concurrent scraping:

    foreach ($urls as $url) {
        ScrapeJob::dispatch($url)->onQueue('scraping');
    }
    
Weaver

How can I help you explore Laravel packages today?

Conversation history is not saved when not logged in.
Prompt
Add packages to context
No packages found.
hamzi/corewatch
minionfactory/raw-hydrator
hexters/coinpayment
rjcodes/rjcms
act-training/laravel-permissions-manager
alimarchal/laravel-chart-of-accounts
babenkoivan/elastic-scout-driver
mkwebdesign/filament-watchdog-v5
renatomarinho/laravel-page-speed
zedmagdy/filament-business-hours
renatovdemoura/blade-elements-ui
devgeek/beacon-admin
benjamin-rqt/data-watcher-bundle
atriumphp/atrium
sandermuller/package-boost-laravel
sandermuller/boost-skills
redaxo/core
yusufgenc/filament-api-forge
l3aro/rating-star-for-filament
leek/filament-subtenant-scope