Weave Code
Code Weaver
Helps Laravel developers discover, compare, and choose open-source packages. See popularity, security, maintainers, and scores at a glance to make better decisions.
Feedback
Share your thoughts, report bugs, or suggest improvements.
Subject
Message

Goutte Laravel Package

fabpot/goutte

Goutte is a PHP web scraping and web testing library built on Symfony components. It provides a simple API to crawl pages, submit forms, click links, and extract content with CSS selectors—handy for quick crawlers, monitors, and functional checks.

View on GitHub
Deep Wiki
Context7

Getting Started

Minimal Steps

  1. Installation (for legacy projects or quick prototyping):

    composer require fabpot/goutte
    

    For new projects, use Symfony’s HttpBrowser directly via:

    composer require symfony/browser-kit symfony/http-client
    
  2. Basic Scraper Setup:

    use Goutte\Client; // or Symfony\Component\BrowserKit\HttpBrowser
    
    $client = new Client();
    $crawler = $client->request('GET', 'https://example.com');
    
  3. First Use Case: Extract Links

    $links = $crawler->filter('a')->each(function ($node) {
        return $node->attr('href');
    });
    
  4. Laravel Integration (Service Provider):

    // app/Providers/AppServiceProvider.php
    public function register()
    {
        $this->app->singleton(Goutte\Client::class, function ($app) {
            return new Goutte\Client();
        });
    }
    

    Usage in Controller:

    use Goutte\Client;
    
    public function scrape(Client $client)
    {
        $crawler = $client->request('GET', 'https://example.com');
        // Process $crawler...
    }
    
  5. Queueable Scraping (for long-running tasks):

    php artisan make:job ScrapeJob
    
    // app/Jobs/ScrapeJob.php
    public function handle()
    {
        $client = new Client();
        $crawler = $client->request('GET', 'https://example.com');
        // Store results or dispatch events...
    }
    

    Dispatch via:

    ScrapeJob::dispatch()->onQueue('scraping');
    

Implementation Patterns

1. Data Extraction Patterns

Basic CSS Selectors

// Extract text from elements
$titles = $crawler->filter('h1, h2')->text();

// Extract attributes
$images = $crawler->filter('img')->each(function ($node) {
    return $node->attr('src');
});

Nested Data (Tables, Lists)

// Parse HTML tables
$tableData = $crawler->filter('table tr')->each(function ($row) {
    return $row->filter('td')->extract(['text']);
});

// Parse nested lists
$menu = $crawler->filter('ul li')->each(function ($item) {
    return [
        'text' => $item->text(),
        'children' => $item->filter('ul')->count() > 0
            ? $item->filter('ul')->each(fn($child) => $child->text())
            : []
    ];
});

Pagination Handling

$page = 1;
$client = new Client();
while ($page <= 5) { // Example: 5 pages
    $crawler = $client->request('GET', "https://example.com/page/$page");
    $items = $crawler->filter('.product')->each(fn($node) => $node->text());

    $page++;
}

2. Form Interaction

Submitting Forms

$crawler = $client->request('POST', '/login', [
    'email' => 'user@example.com',
    'password' => 'secret',
]);

Handling CSRF Tokens

$crawler = $client->request('GET', '/form-page');
$form = $crawler->selectButton('Submit')->form();
$form['csrf_token'] = $crawler->filter('input[name="csrf_token"]')->attr('value');
$client->submit($form);

3. Dynamic Content (AJAX/SPA Workarounds)

Fetching AJAX Responses

$client = new Client();
$client->getClient()->getCookieJar()->set('session_id', 'abc123'); // If needed
$crawler = $client->request('GET', '/api/data', [], [], [
    'HTTP_X_REQUESTED_WITH' => 'XMLHttpRequest',
]);

Simulating User Agents

$client = new Client();
$client->getClient()->getOptions()['headers']['User-Agent'] = 'Mozilla/5.0';
$crawler = $client->request('GET', 'https://example.com');

4. Laravel-Specific Patterns

Artisan Commands for CLI Scraping

// app/Console/Commands/ScrapeCommand.php
public function handle()
{
    $client = new Client();
    $crawler = $client->request('GET', $this->argument('url'));

    $this->info('Scraped data: ' . $crawler->filter('h1')->text());
}

Usage:

php artisan scrape:run --url=https://example.com

Event-Based Scraping

// Dispatch events after scraping
event(new ScrapedDataEvent($crawler->filter('.product')->each(...)));

Caching Responses

use Illuminate\Support\Facades\Cache;

$cachedData = Cache::remember('scraped_data', now()->addHours(1), function () {
    $client = new Client();
    return $client->request('GET', 'https://example.com')->html();
});

5. Error Handling & Retries

Retry Mechanism with Guzzle Middleware

use Symfony\Component\HttpClient\RetryMiddleware;

$client = new Client();
$client->getClient()->setOptions([
    'middlewares' => [
        new RetryMiddleware(),
    ],
]);

Custom Exception Handling

try {
    $crawler = $client->request('GET', 'https://example.com');
} catch (\Goutte\Exception\RequestException $e) {
    Log::error('Scraping failed: ' . $e->getMessage());
    // Fallback logic...
}

Gotchas and Tips

Pitfalls

  1. Deprecation Warning:

    • Goutte v4+ is a proxy to Symfony’s HttpBrowser. Avoid new dependencies; migrate to:
      use Symfony\Component\BrowserKit\HttpBrowser;
      use Symfony\Component\DomCrawler\Crawler;
      
    • Tip: Use symfony/browser-kit directly for future-proofing.
  2. JavaScript-Rendered Content:

    • Goutte cannot execute JS. For SPAs (React/Angular), use:
      • PHP-Puppeteer: php-puppeteer/php-puppeteer.
      • Symfony Panthr: symfony/panthr (headless Chrome).
    • Workaround: Fetch initial HTML with HttpClient, then parse.
  3. Rate Limiting & IP Bans:

    • Default HttpClient has no built-in delays. Add middleware:
      $client->getClient()->setOptions([
          'delay' => 2000, // 2-second delay between requests
      ]);
      
    • Tip: Use proxies (e.g., symfony/http-client-proxy) for high-volume scraping.
  4. Memory Leaks:

    • Large DOMs (e.g., e-commerce sites) can exhaust memory. Mitigate with:
      • Chunked processing: Parse pages in batches.
      • DomCrawler filtering: Narrow queries early:
        $crawler->filter('.product')->each(...); // Instead of full page
        
  5. CSRF Tokens & Dynamic Forms:

    • Many sites use dynamic tokens. Extract them first:
      $token = $crawler->filter('input[name="csrf_token"]')->attr('value');
      $form = $crawler->selectButton('Submit')->form();
      $form['csrf_token'] = $token;
      
  6. Laravel Service Container Conflicts:

    • If using Symfony’s HttpClient elsewhere, Goutte’s client may conflict. Bind explicitly:
      $this->app->bind(Goutte\Client::class, function ($app) {
          $httpClient = $app->make(Symfony\Component\HttpClient\HttpClient::class);
          return new Goutte\Client($httpClient);
      });
      

Debugging Tips

  1. Inspect Raw Responses:

    $response = $client->getResponse();
    file_put_contents('debug.html', $response->getContent());
    
  2. Log Headers & Cookies:

    $client->getClient()->getOptions()['headers']['User-Agent'] = 'MyScraper/1.0';
    $client->getClient()->getCookieJar()->set('session', 'abc123');
    
  3. Validate Selectors:

    • Use browser dev tools to test CSS selectors before coding.
    • Tip: Use ->each() to debug:
      $crawler->filter('.nonexistent')->each(fn($node) => dd($node->
      
Weaver

How can I help you explore Laravel packages today?

Conversation history is not saved when not logged in.
Prompt
Add packages to context
No packages found.
cuci/prototurk-sdk-symfony
clementtalleu/easyadmin-markdown-bundle
codeflextech/permission-manager
karnoweb/livewire-datepicker
sayedenam/sayed-dashboard
milito/query-filter
apiboxsym/user-bundle
apiboxsym/health-check-bundle
jayeshmepani/jpl-moshier-ephemeris-php
elnasnato/laraliveui
labrodev/rest-sdk
sampaui/sampaui
babelqueue/php-sdk
facebook/capi-param-builder-php
babelqueue/symfony
hamzi/corewatch
minionfactory/raw-hydrator
hexters/coinpayment
rjcodes/rjcms
act-training/laravel-permissions-manager