spatie/crawler
PHP web crawler that discovers links concurrently via Guzzle, with optional JavaScript rendering powered by Chrome/Puppeteer. Configure depth, internal-only rules, and callbacks for per-page handling, plus a fake mode to test crawl logic without real HTTP requests.
Strengths:
Event::dispatch()) or queue workers (e.g., shouldQueue).spatie/browsershot (Puppeteer/Chrome) for SPAs or dynamic content, critical for modern web scraping.CrawlProgress and FinishReason align with Laravel’s logging (e.g., Log::info()) and monitoring needs.Weaknesses:
shouldStopCallback).Crawler::create()->onCrawled(fn() => dispatch(new ProcessUrlJob($url)))).Cache::remember() for foundUrls()).urls table with url, status, last_crawled_at).max_execution_time in PHP).CrawlResponse).queue:work with sleep() in callbacks to throttle requests.robots.txt, terms of service)?UrlCrawled) or notifications?$this->app->singleton(Crawler::class, fn() => new Crawler());
config/crawler.php) for:
'default_depth' => 3,
'max_concurrency' => 10,
'use_javascript' => env('CRAWLER_USE_JS', false),
crawl:run command for CLI execution:
php artisan crawl:run https://example.com --depth=2
CrawlerJob::dispatch('https://example.com')->onQueue('crawls');
class CrawledUrl extends Model {
protected $fillable = ['url', 'status_code', 'last_crawled_at', 'depth'];
}
public function crawled(string $url, CrawlResponse $response) {
CrawledUrl::updateOrCreate(
['url' => $url],
['status_code' => $response->status(), 'last_crawled_at' => now()]
);
}
foundUrls() and onCrawled callbacks.spatie/laravel-activitylog).Laravel Horizon for queue stats).dom, fileinfo, and mbstring (standard in Laravel).guzzlehttp/guzzle (Laravel dependency).spatie/browsershot (only if using JS rendering; ~200MB overhead).composer require spatie/crawler.php artisan vendor:publish --provider="Spatie\Crawler\CrawlerServiceProvider".fake() for testing.CRAWLER_CONCURRENCY=5).$schedule->command('crawl:run {url}')->daily();
LoggingObserver, DatabaseObserver).spatie/crawler and spatie/browsershot may introduce breaking changes.PackageServiceProvider to isolate crawler logic.interface CrawlObserverContract).CrawlProgress for real-time monitoring (e.g., log to Laravel Log).FinishReason to handle crawl termination gracefully.guzzle timeout settings or implement circuit breakers.stderr).->uniqueUrls() or database unique constraints.->limit() to batch requests (e.g., 100 URLs per job).max_concurrency (default: 10) for faster crawls (but monitor server load).CrawledUrl (e.g., DB::table('urls')->insert($batch)).| Failure Type | Impact | Mitigation |
|---|---|---|
| HTTP Rate Limiting | Crawl halts or gets blocked. | Implement exponential backoff in callbacks. |
| Puppeteer Crashes | JS rendering fails silently. | Use health checks (e.g., |
How can I help you explore Laravel packages today?