Weave Code
Code Weaver
Helps Laravel developers discover, compare, and choose open-source packages. See popularity, security, maintainers, and scores at a glance to make better decisions.
Feedback
Share your thoughts, report bugs, or suggest improvements.
Subject
Message

Crawler Laravel Package

spatie/crawler

Fast, concurrent web crawler for PHP. Crawl sites, collect internal URLs with depth limits, and hook into crawl events. Can execute JavaScript via Chrome/Puppeteer for rendered pages. Includes fakes for testing crawl logic without real HTTP requests.

View on GitHub
Deep Wiki
Context7

Getting Started

Minimal Setup

  1. Installation:
    composer require spatie/crawler
    
  2. First Crawl:
    use Spatie\Crawler\Crawler;
    
    Crawler::create('https://example.com')
        ->onCrawled(function (string $url, $response) {
            echo "Crawled: {$url}\n";
        })
        ->start();
    

First Use Case: Extract All Internal Links

$urls = Crawler::create('https://example.com')
    ->internalOnly()
    ->depth(2)
    ->foundUrls();

Key Starting Points

  • Documentation: Official docs with API reference.
  • Crawler facade: Central entry point for all crawler operations.
  • CrawlResponse: Inspect crawled pages (status, body, headers).
  • CrawlProgress: Track crawl metrics in real-time.

Implementation Patterns

1. Observer-Based Workflows

Use observers for structured, reusable crawl logic:

// app/CrawlObservers/UrlExtractor.php
namespace App\CrawlObservers;

use Spatie\Crawler\CrawlObservers\CrawlObserver;
use Spatie\Crawler\CrawlResponse;

class UrlExtractor extends CrawlObserver
{
    public function crawled(string $url, CrawlResponse $response): void
    {
        $this->extractData($response->body());
    }
}

Usage:

Crawler::create('https://example.com')
    ->addObserver(new UrlExtractor())
    ->start();

2. Concurrent Crawling with Limits

Leverage concurrency for performance:

Crawler::create('https://example.com')
    ->concurrency(10) // 10 concurrent requests
    ->limit(100)      // Stop after 100 URLs
    ->onCrawled(fn($url, $response) => $this->process($url, $response))
    ->start();

3. Dynamic URL Filtering

Filter URLs during crawl:

Crawler::create('https://example.com')
    ->shouldCrawl(function (string $url) {
        return str_contains($url, 'products') || str_contains($url, 'blog');
    })
    ->start();

4. JavaScript-Rendered Pages

Enable Puppeteer for SPAs:

Crawler::create('https://example.com')
    ->withBrowser() // Uses Browsershot under the hood
    ->onCrawled(fn($url, $response) => $this->saveScreenshot($url, $response->body()))
    ->start();

5. Data Extraction Pipeline

Chain observers for multi-step processing:

Crawler::create('https://example.com')
    ->addObserver(new UrlExtractor())
    ->addObserver(new DataParser())
    ->addObserver(new DatabaseSaver())
    ->start();

6. Testing with Fake Responses

Unit test crawlers without HTTP calls:

$fakeResponses = [
    'https://example.com' => '<html><a href="/about">About</a></html>',
    'https://example.com/about' => '<html>About Page</html>',
];

Crawler::create('https://example.com')
    ->fake($fakeResponses)
    ->onCrawled(fn($url, $response) => $this->assertResponse($url, $response))
    ->start();

Gotchas and Tips

Pitfalls

  1. Rate Limiting:

    • Default concurrency (5) may trigger 429 Too Many Requests. Adjust with ->concurrency(N).
    • Fix: Add delays between requests:
      ->delay(1000) // 1-second delay between batches
      
  2. Infinite Loops:

    • Unbounded crawls (e.g., ->depth(null)) can spiral. Always set:
      ->limit(1000) // Safety net
      ->depth(3)    // Practical depth limit
      
  3. JavaScript Overhead:

    • ->withBrowser() adds ~500ms per request. Use sparingly:
      ->shouldUseBrowser(fn($url) => str_contains($url, 'dashboard'))
      
  4. Robots.txt Ignored:

    • By default, robots.txt is respected. Disable with:
      ->ignoreRobotsTxt()
      
  5. Memory Leaks:

    • Large crawls (e.g., ->depth(5)) may exhaust memory. Use:
      ->memoryLimit(512) // 512MB limit
      

Debugging Tips

  1. Log Failed Requests:

    ->onFailed(fn($url, $exception) => Log::error("Failed: {$url}", ['exception' => $exception]))
    
  2. Inspect CrawlProgress:

    ->onCrawled(fn($url, $response, $progress) => Log::info("Progress: {$progress->urlsProcessed}/{$progress->urlsFound}"))
    
  3. Validate URLs:

    • Use ->shouldCrawl() to filter malformed URLs:
      ->shouldCrawl(fn($url) => filter_var($url, FILTER_VALIDATE_URL) !== false)
      

Extension Points

  1. Custom Response Handling: Extend CrawlResponse for domain-specific logic:

    class MyCrawlResponse extends CrawlResponse
    {
        public function extractProductData(): array
        {
            return json_decode($this->body(), true);
        }
    }
    

    Usage:

    ->onCrawled(fn($url, $response) => $response->extractProductData())
    
  2. Middleware for Requests: Add headers/cookies globally:

    use Spatie\Crawler\Crawler;
    use GuzzleHttp\Client;
    
    $client = new Client([
        'headers' => ['User-Agent' => 'MyCrawler/1.0'],
        'cookies' => ['session_id' => 'abc123'],
    ]);
    
    Crawler::create('https://example.com')
        ->withClient($client)
        ->start();
    
  3. Queue Integration: Offload crawls to Laravel queues:

    // app/Jobs/CrawlJob.php
    public function handle()
    {
        Crawler::create($this->url)
            ->onCrawled(fn($url, $response) => $this->dispatch(new ProcessPage($url, $response)))
            ->start();
    }
    

Performance Quirks

  • Concurrency vs. Speed: Higher concurrency (->concurrency(20)) speeds up crawls but increases server load.
  • Depth vs. Breadth: Shallow crawls (->depth(1)) are faster but miss nested content.
  • Fake Mode: Use ->fake() for testing, but note it doesn’t replicate:
    • Redirects (unless manually mocked).
    • Dynamic content (e.g., API calls in JS).

Configuration

  • Global Settings: Publish config:
    php artisan vendor:publish --provider="Spatie\Crawler\CrawlerServiceProvider"
    
    Customize in config/crawler.php:
    'default_concurrency' => 10,
    'default_timeout' => 30,
    'user_agent' => 'MyAppCrawler/1.0',
    
Weaver

How can I help you explore Laravel packages today?

Conversation history is not saved when not logged in.
Prompt
Add packages to context
No packages found.
davejamesmiller/laravel-breadcrumbs
artisanry/parsedown
christhompsontldr/phpsdk
enqueue/dsn
bunny/bunny
enqueue/test
enqueue/null
enqueue/amqp-tools
milesj/emojibase
bower-asset/punycode
bower-asset/inputmask
bower-asset/jquery
bower-asset/yii2-pjax
laravel/nova
spatie/laravel-mailcoach
spatie/laravel-superseeder
laravel/liferaft
nst/json-test-suite
danielmiessler/sec-lists
jackalope/jackalope-transport