Weave Code
Code Weaver
Helps Laravel developers discover, compare, and choose open-source packages. See popularity, security, maintainers, and scores at a glance to make better decisions.
Feedback
Share your thoughts, report bugs, or suggest improvements.
Subject
Message

Crawler Laravel Package

spatie/crawler

PHP web crawler that discovers links concurrently via Guzzle, with optional JavaScript rendering powered by Chrome/Puppeteer. Configure depth, internal-only rules, and callbacks for per-page handling, plus a fake mode to test crawl logic without real HTTP requests.

View on GitHub
Deep Wiki
Context7

Getting Started

Minimal Setup

  1. Installation:

    composer require spatie/crawler
    
  2. First Crawl:

    use Spatie\Crawler\Crawler;
    
    Crawler::create('https://example.com')
        ->onCrawled(function (string $url, CrawlResponse $response) {
            echo "Crawled: {$url} (Status: {$response->status()})";
        })
        ->start();
    
  3. Key First Use Cases:

    • Collect all internal URLs:
      $urls = Crawler::create('https://example.com')
          ->internalOnly()
          ->depth(2)
          ->foundUrls();
      
    • Test locally with fake responses:
      Crawler::create('https://example.com')
          ->fake([
              'https://example.com' => '<html><a href="/about">About</a></html>',
              'https://example.com/about' => '<html>About page</html>',
          ])
          ->foundUrls();
      

Implementation Patterns

Core Workflows

  1. Structured Crawling with Observers:

    class PageLogger extends CrawlObserver {
        public function crawled(string $url, CrawlResponse $response) {
            Log::info("Crawled {$url} with status {$response->status()}");
        }
    }
    
    Crawler::create('https://example.com')
        ->addObserver(new PageLogger())
        ->start();
    
  2. Progress Tracking:

    Crawler::create('https://example.com')
        ->onCrawled(function (string $url, CrawlResponse $response, CrawlProgress $progress) {
            echo "Progress: {$progress->urlsProcessed}/{$progress->urlsFound}\n";
        })
        ->start();
    
  3. Dynamic Crawl Control:

    $shouldStop = false;
    Crawler::create('https://example.com')
        ->shouldStopCallback(function (Crawler $crawler) use (&$shouldStop) {
            return $shouldStop;
        })
        ->onCrawled(function (string $url) use (&$shouldStop) {
            if ($url === 'https://example.com/stop') {
                $shouldStop = true;
            }
        })
        ->start();
    

Integration Tips

  • Laravel Service Provider:
    public function register() {
        $this->app->singleton(Crawler::class, function () {
            return new Crawler();
        });
    }
    
  • Queue Jobs for Large Crawls:
    class CrawlJob implements ShouldQueue {
        public function handle() {
            Crawler::create('https://example.com')
                ->limit(1000)
                ->start();
        }
    }
    
  • Store Results in Database:
    Crawler::create('https://example.com')
        ->onCrawled(function (string $url, CrawlResponse $response) {
            Page::updateOrCreate(['url' => $url], [
                'status_code' => $response->status(),
                'content' => $response->body(),
            ]);
        })
        ->start();
    

Gotchas and Tips

Common Pitfalls

  1. JavaScript Rendering Overhead:

    • Enabling JavaScript (->withBrowser()) significantly increases crawl time. Use only when necessary.
    • Tip: Test with ->fake() first to validate logic before enabling JS.
  2. Rate Limiting:

    • Default concurrency (10) may trigger rate limits. Adjust with:
      ->concurrency(2) // Lower for sensitive sites
      
  3. Robots.txt Ignored by Default:

    • To respect robots.txt, use:
      ->respectRobotsTxt()
      
  4. Duplicate URLs:

    • Use ->uniqueUrls() to avoid processing the same URL multiple times.
  5. Memory Leaks:

    • Large crawls may exhaust memory. Use ->limit() and process results incrementally.

Debugging Tips

  • Log Failed Requests:
    ->onFailed(function (string $url, RequestException $exception) {
        Log::error("Failed to crawl {$url}: {$exception->getMessage()}");
    })
    
  • Inspect Transfer Stats:
    ->onFailed(function (string $url, RequestException $exception, TransferStatistics $stats) {
        if ($stats->transferTimeInMs() > 10000) {
            Log::warning("Slow crawl: {$url} took {$stats->transferTimeInMs()}ms");
        }
    })
    

Extension Points

  1. Custom URL Filtering:

    ->filterUrls(function (string $url, string $foundOnUrl) {
        return str_contains($url, 'important-section');
    })
    
  2. Modify Request Headers:

    ->withOptions([
        'headers' => [
            'User-Agent' => 'MyCrawler/1.0',
            'Accept-Language' => 'en-US',
        ],
    ])
    
  3. Handle Redirects:

    ->followRedirects(false) // Disable by default
    ->onRedirect(function (string $from, string $to) {
        Log::info("Redirect: {$from} -> {$to}");
    })
    
  4. Custom Response Processing:

    ->onCrawled(function (string $url, CrawlResponse $response) {
        $dom = new \DomDocument();
        @$dom->loadHTML($response->body());
        // Process DOM here
    })
    

Performance Optimization

  • Cache Responses:
    ->withCache(new FileCache(storage_path('crawler_cache')))
    
  • Prioritize URLs:
    ->prioritizeUrls(['/high-priority-page'])
    
  • Disable Unnecessary Features:
    ->withoutBrowser() // Skip JS rendering
    ->withoutCurl()    // Use Guzzle only
    
Weaver

How can I help you explore Laravel packages today?

Conversation history is not saved when not logged in.
Prompt
Add packages to context
No packages found.
hamzi/corewatch
minionfactory/raw-hydrator
hexters/coinpayment
rjcodes/rjcms
act-training/laravel-permissions-manager
alimarchal/laravel-chart-of-accounts
babenkoivan/elastic-scout-driver
mkwebdesign/filament-watchdog-v5
renatomarinho/laravel-page-speed
zedmagdy/filament-business-hours
renatovdemoura/blade-elements-ui
devgeek/beacon-admin
benjamin-rqt/data-watcher-bundle
atriumphp/atrium
sandermuller/package-boost-laravel
sandermuller/boost-skills
redaxo/core
yusufgenc/filament-api-forge
l3aro/rating-star-for-filament
leek/filament-subtenant-scope