symfony/dom-crawler
Symfony DomCrawler makes it easy to navigate and query HTML/XML DOMs using CSS selectors and XPath. Extract links, forms, and text, filter nodes, and chain queries for robust scraping, testing, and content parsing in PHP.
Installation:
composer require symfony/dom-crawler
No additional configuration is required—it’s a standalone component.
First Use Case: Parse a simple HTML string and extract data:
use Symfony\Component\DomCrawler\Crawler;
$html = '<html><body><h1>Title</h1><p>Content</p></body></html>';
$crawler = new Crawler($html);
// Extract the title
$title = $crawler->filter('h1')->text();
// Output: "Title"
Where to Look First:
$client = new \GuzzleHttp\Client();
$response = $client->request('GET', 'https://example.com');
$html = $response->getBody()->getContents();
$crawler = new Crawler($html);
$links = $crawler->filter('a')->links();
foreach ($links as $link) {
echo $link->getUri() . "\n";
}
$crawler = new Crawler($html);
$form = $crawler->selectButton('Submit')->form();
$form['username'] = 'test';
$form['password'] = 'secret';
$client->submit($form);
$nodes = $crawler->filterXPath('//div[@class="product"]//span[@itemprop="price"]');
foreach ($nodes as $node) {
echo $node->textContent . "\n";
}
$crawler->filter('table tr')
->each(function (Crawler $row) {
echo $row->filter('td')->text() . "\n";
});
// app/Providers/AppServiceProvider.php
public function register()
{
$this->app->singleton(Crawler::class, function () {
return new Crawler();
});
}
$client = new \GuzzleHttp\Client([
'on_request' => function ($request) {
$request->getHeaders()->set('User-Agent', 'Mozilla/5.0');
}
]);
Handle Malformed HTML: Use the native HTML5 parser (PHP 8.4+) for robustness:
$crawler = new Crawler($html, 'https://example.com');
Rate Limiting: Add delays between requests to avoid IP bans:
$client->getConfig(['delay' => 2]); // 2-second delay
Store Parsed Data: Use Laravel’s Eloquent or collections to persist results:
$products = $crawler->filter('.product')->map(function (Crawler $node) {
return [
'name' => $node->filter('h2')->text(),
'price' => $node->filter('.price')->text(),
];
});
Product::insert($products->toArray());
Testing: Mock HTML responses in PHPUnit:
$crawler = new Crawler('<html><body><div>Test</div></body></html>');
$this->assertEquals('Test', $crawler->filter('div')->text());
XXE Vulnerabilities:
addXmlContent() (fixed in v8.0.12+):
// UNSAFE: Disabled validateOnParse by default in newer versions
$crawler = new Crawler();
$crawler->addXmlContent('<root><![CDATA[<script>alert(1)</script>]]></root>');
addHtmlContent() for HTML.Case-Sensitive Selectors:
// Works (CSS)
$crawler->filter('div.className');
// May fail (XPath)
$crawler->filterXPath('//DIV[@class="className"]');
Orphaned Nodes:
filter() carefully:
// May throw if parent is null
$crawler->filter('div > span')->each(...);
filterXPath('//span') for broader matching.Memory Leaks:
$client = new \GuzzleHttp\Client();
$urls = ['url1', 'url2', ...];
foreach ($urls as $url) {
$html = $client->get($url)->getBody();
yield new Crawler($html);
}
Attribute Selection:
value on <button>) require explicit handling:
$value = $crawler->filter('button')->attr('value'); // Works in v7.3.1+
Inspect Nodes:
Use html() to debug rendered output:
echo $crawler->filter('div')->html();
Log Queries: Add logging for failed selectors:
if ($crawler->filter('selector')->count() === 0) {
Log::warning('Selector not found: selector', ['crawler' => $crawler->html()]);
}
Validate HTML: Use tools like W3C Validator to pre-check input.
Custom Crawler Classes:
Extend Crawler for reusable logic:
class ProductCrawler extends Crawler {
public function extractProducts() {
return $this->filter('.product')->map(...);
}
}
Event Listeners: Hook into Symfony’s event system (if using Symfony) to modify crawler behavior.
Laravel Service Providers: Bind custom crawler instances:
$this->app->bind(ProductCrawler::class, function () {
return new ProductCrawler();
});
Testing Helpers: Create a base test case:
abstract class CrawlerTestCase extends TestCase {
protected function assertCrawlerHasText(Crawler $crawler, string $selector, string $text) {
$this->assertEquals($text, $crawler->filter($selector)->text());
}
}
XPath vs. CSS:
// Faster
$crawler->filter('div.product');
// Slower (but flexible)
$crawler->filterXPath('//div[contains(@class, "product")]');
Reuse Crawlers: Avoid recreating crawlers for small changes:
// Bad: Creates new crawler per iteration
foreach ($urls as $url) {
$crawler = new Crawler($html);
// ...
}
// Good: Reuse crawler
$crawler = new Crawler();
foreach ($urls as $url) {
$crawler->addHtmlContent($html);
// ...
}
Parallel Processing: Use Laravel’s queues or Symfony’s Messenger for concurrent scraping:
foreach ($urls as $url) {
ScrapeJob::dispatch($url)->onQueue('scraping');
}
How can I help you explore Laravel packages today?