symfony/dom-crawler
Symfony DomCrawler makes it easy to parse and navigate HTML/XML documents. It provides a fluent API to filter elements, extract text/attributes, follow links and forms, and integrates well with HttpClient and BrowserKit for web scraping and testing.
Installation:
composer require symfony/dom-crawler
No additional configuration is required—it’s a standalone component.
First Use Case: Parse a simple HTML string or URL response:
use Symfony\Component\DomCrawler\Crawler;
$html = '<html><body><h1>Title</h1><p>Content</p></body></html>';
$crawler = new Crawler($html);
// Extract the title text
$title = $crawler->filter('h1')->text();
// Output: "Title"
Where to Look First:
Http client for scraping live pages:
use Illuminate\Support\Facades\Http;
use Symfony\Component\DomCrawler\Crawler;
$response = Http::get('https://example.com');
$crawler = new Crawler($response->body());
Extract tables, lists, or forms from HTML:
// Extract all table rows
$rows = $crawler->filter('table tr')->each(function (Crawler $node) {
return $node->filter('td')->text();
});
// Example output: [['Col1', 'Col2'], ['Data1', 'Data2'], ...]
Automate interactions with legacy forms:
$form = $crawler->selectButton('Submit')->form();
$form['username'] = 'test';
$form['password'] = 'secret';
$client = new Client(); // Symfony BrowserKit or Laravel's HttpTestResponse
$response = $client->submit($form);
Leverage XPath when CSS selectors fall short:
$nodes = $crawler->filterXPath('//div[@class="product" and @data-id]');
Encapsulate logic in a service for reusability:
namespace App\Services;
use Symfony\Component\DomCrawler\Crawler;
use Illuminate\Support\Facades\Http;
class ScraperService {
public function scrapeProductData(string $url): array {
$response = Http::get($url);
$crawler = new Crawler($response->body());
return $crawler->filter('.product')->each(function (Crawler $node) {
return [
'name' => $node->filter('h2')->text(),
'price' => $node->filter('.price')->text(),
];
});
}
}
Assert DOM structure in Laravel tests:
use Illuminate\Foundation\Testing\TestCase;
use Symfony\Component\DomCrawler\Crawler;
public function testHomepageStructure() {
$response = $this->get('/');
$crawler = new Crawler($response->getContent());
$this->assertEquals('Welcome', $crawler->filter('h1')->text());
$this->assertCount(3, $crawler->filter('.feature'));
}
Loop through paginated results:
$baseUrl = 'https://example.com/page/';
$data = [];
for ($i = 1; $i <= 5; $i++) {
$response = Http::get($baseUrl . $i);
$crawler = new Crawler($response->body());
$data[] = $crawler->filter('.item')->extract(['href', 'title']);
}
Parse XML feeds (e.g., RSS):
$xml = '<root><item><title>Test</title></item></root>';
$crawler = new Crawler($xml);
$titles = $crawler->filter('item title')->text();
Combine with Laravel’s Http facade for seamless requests:
use Illuminate\Support\Facades\Http;
use Symfony\Component\DomCrawler\Crawler;
$response = Http::withOptions(['verify' => false])->get('https://example.com');
$crawler = new Crawler($response->body());
For advanced HTTP features (e.g., cookies, headers):
use GuzzleHttp\Client;
use Symfony\Component\DomCrawler\Crawler;
$client = new Client();
$response = $client->request('GET', 'https://example.com', [
'headers' => ['User-Agent' => 'Mozilla/5.0'],
]);
$crawler = new Crawler($response->getBody());
Offload scraping to background jobs:
namespace App\Jobs;
use Illuminate\Bus\Queueable;
use Symfony\Component\DomCrawler\Crawler;
use Illuminate\Support\Facades\Http;
class ScrapeJob implements Queueable {
public function handle() {
$response = Http::get('https://example.com');
$crawler = new Crawler($response->body());
// Process data...
}
}
Pre-render HTML for testing:
$html = $this->blade->render('@include('partials.product-card')', ['product' => $product]);
$crawler = new Crawler($html);
libxml_use_internal_errors(true) or Symfony’s Html5Parser (PHP 8.4+):
// For PHP 8.4+
$crawler = new Crawler($html, 'https://example.com');
// Or pre-process with Tidy:
$tidy = new \tidy;
$tidy->parseString($html, $config, 'utf8');
$crawler = new Crawler($tidy->cleanRepair());
// CSS (case-sensitive)
$crawler->filter('div.classname');
// XPath (case-insensitive)
$crawler->filterXPath('//div[contains(translate(@class, "ABCDEFGHIJKLMNOPQRSTUVWXYZ", "abcdefghijklmnopqrstuvwxyz"), "classname")]');
link() or attr('href') may be relative.Crawler::filter() with baseUri or resolve manually:
$crawler = new Crawler($html, 'https://example.com');
$links = $crawler->filter('a')->links();
// Output: ['https://example.com/absolute-link', ...]
$crawler = new Crawler($html, 'https://example.com');
$crawler->filter('div.item')->each(function (Crawler $node) {
// Process one item at a time
});
use Symfony\Component\Panther\Client;
$client = Client::createChromeClient();
$crawler = $client->request('GET', 'https://example.com');
$html = mb_convert_encoding($html, 'UTF-8', 'ISO-8859-1');
$crawler = new Crawler($html);
Use html() to debug the parsed structure:
$crawler = new Crawler($html);
dd($crawler->html()); // Dump raw HTML for inspection
Test selectors in browser DevTools first, then adapt for DomCrawler:
// Test if selector matches anything
if ($crawler
How can I help you explore Laravel packages today?