fabpot/goutte
Goutte is a PHP web scraping and web testing library built on Symfony components. It provides a simple API to crawl pages, submit forms, click links, and extract content with CSS selectors—handy for quick crawlers, monitors, and functional checks.
Installation (for legacy projects or quick prototyping):
composer require fabpot/goutte
For new projects, use Symfony’s HttpBrowser directly via:
composer require symfony/browser-kit symfony/http-client
Basic Scraper Setup:
use Goutte\Client; // or Symfony\Component\BrowserKit\HttpBrowser
$client = new Client();
$crawler = $client->request('GET', 'https://example.com');
First Use Case: Extract Links
$links = $crawler->filter('a')->each(function ($node) {
return $node->attr('href');
});
Laravel Integration (Service Provider):
// app/Providers/AppServiceProvider.php
public function register()
{
$this->app->singleton(Goutte\Client::class, function ($app) {
return new Goutte\Client();
});
}
Usage in Controller:
use Goutte\Client;
public function scrape(Client $client)
{
$crawler = $client->request('GET', 'https://example.com');
// Process $crawler...
}
Queueable Scraping (for long-running tasks):
php artisan make:job ScrapeJob
// app/Jobs/ScrapeJob.php
public function handle()
{
$client = new Client();
$crawler = $client->request('GET', 'https://example.com');
// Store results or dispatch events...
}
Dispatch via:
ScrapeJob::dispatch()->onQueue('scraping');
// Extract text from elements
$titles = $crawler->filter('h1, h2')->text();
// Extract attributes
$images = $crawler->filter('img')->each(function ($node) {
return $node->attr('src');
});
// Parse HTML tables
$tableData = $crawler->filter('table tr')->each(function ($row) {
return $row->filter('td')->extract(['text']);
});
// Parse nested lists
$menu = $crawler->filter('ul li')->each(function ($item) {
return [
'text' => $item->text(),
'children' => $item->filter('ul')->count() > 0
? $item->filter('ul')->each(fn($child) => $child->text())
: []
];
});
$page = 1;
$client = new Client();
while ($page <= 5) { // Example: 5 pages
$crawler = $client->request('GET', "https://example.com/page/$page");
$items = $crawler->filter('.product')->each(fn($node) => $node->text());
$page++;
}
$crawler = $client->request('POST', '/login', [
'email' => 'user@example.com',
'password' => 'secret',
]);
$crawler = $client->request('GET', '/form-page');
$form = $crawler->selectButton('Submit')->form();
$form['csrf_token'] = $crawler->filter('input[name="csrf_token"]')->attr('value');
$client->submit($form);
$client = new Client();
$client->getClient()->getCookieJar()->set('session_id', 'abc123'); // If needed
$crawler = $client->request('GET', '/api/data', [], [], [
'HTTP_X_REQUESTED_WITH' => 'XMLHttpRequest',
]);
$client = new Client();
$client->getClient()->getOptions()['headers']['User-Agent'] = 'Mozilla/5.0';
$crawler = $client->request('GET', 'https://example.com');
// app/Console/Commands/ScrapeCommand.php
public function handle()
{
$client = new Client();
$crawler = $client->request('GET', $this->argument('url'));
$this->info('Scraped data: ' . $crawler->filter('h1')->text());
}
Usage:
php artisan scrape:run --url=https://example.com
// Dispatch events after scraping
event(new ScrapedDataEvent($crawler->filter('.product')->each(...)));
use Illuminate\Support\Facades\Cache;
$cachedData = Cache::remember('scraped_data', now()->addHours(1), function () {
$client = new Client();
return $client->request('GET', 'https://example.com')->html();
});
use Symfony\Component\HttpClient\RetryMiddleware;
$client = new Client();
$client->getClient()->setOptions([
'middlewares' => [
new RetryMiddleware(),
],
]);
try {
$crawler = $client->request('GET', 'https://example.com');
} catch (\Goutte\Exception\RequestException $e) {
Log::error('Scraping failed: ' . $e->getMessage());
// Fallback logic...
}
Deprecation Warning:
HttpBrowser. Avoid new dependencies; migrate to:
use Symfony\Component\BrowserKit\HttpBrowser;
use Symfony\Component\DomCrawler\Crawler;
symfony/browser-kit directly for future-proofing.JavaScript-Rendered Content:
php-puppeteer/php-puppeteer.symfony/panthr (headless Chrome).HttpClient, then parse.Rate Limiting & IP Bans:
HttpClient has no built-in delays. Add middleware:
$client->getClient()->setOptions([
'delay' => 2000, // 2-second delay between requests
]);
symfony/http-client-proxy) for high-volume scraping.Memory Leaks:
DomCrawler filtering: Narrow queries early:
$crawler->filter('.product')->each(...); // Instead of full page
CSRF Tokens & Dynamic Forms:
$token = $crawler->filter('input[name="csrf_token"]')->attr('value');
$form = $crawler->selectButton('Submit')->form();
$form['csrf_token'] = $token;
Laravel Service Container Conflicts:
HttpClient elsewhere, Goutte’s client may conflict. Bind explicitly:
$this->app->bind(Goutte\Client::class, function ($app) {
$httpClient = $app->make(Symfony\Component\HttpClient\HttpClient::class);
return new Goutte\Client($httpClient);
});
Inspect Raw Responses:
$response = $client->getResponse();
file_put_contents('debug.html', $response->getContent());
Log Headers & Cookies:
$client->getClient()->getOptions()['headers']['User-Agent'] = 'MyScraper/1.0';
$client->getClient()->getCookieJar()->set('session', 'abc123');
Validate Selectors:
->each() to debug:
$crawler->filter('.nonexistent')->each(fn($node) => dd($node->
How can I help you explore Laravel packages today?