spatie/crawler
PHP web crawler that discovers links concurrently via Guzzle, with optional JavaScript rendering powered by Chrome/Puppeteer. Configure depth, internal-only rules, and callbacks for per-page handling, plus a fake mode to test crawl logic without real HTTP requests.
Installation:
composer require spatie/crawler
First Crawl:
use Spatie\Crawler\Crawler;
Crawler::create('https://example.com')
->onCrawled(function (string $url, CrawlResponse $response) {
echo "Crawled: {$url} (Status: {$response->status()})";
})
->start();
Key First Use Cases:
$urls = Crawler::create('https://example.com')
->internalOnly()
->depth(2)
->foundUrls();
Crawler::create('https://example.com')
->fake([
'https://example.com' => '<html><a href="/about">About</a></html>',
'https://example.com/about' => '<html>About page</html>',
])
->foundUrls();
Structured Crawling with Observers:
class PageLogger extends CrawlObserver {
public function crawled(string $url, CrawlResponse $response) {
Log::info("Crawled {$url} with status {$response->status()}");
}
}
Crawler::create('https://example.com')
->addObserver(new PageLogger())
->start();
Progress Tracking:
Crawler::create('https://example.com')
->onCrawled(function (string $url, CrawlResponse $response, CrawlProgress $progress) {
echo "Progress: {$progress->urlsProcessed}/{$progress->urlsFound}\n";
})
->start();
Dynamic Crawl Control:
$shouldStop = false;
Crawler::create('https://example.com')
->shouldStopCallback(function (Crawler $crawler) use (&$shouldStop) {
return $shouldStop;
})
->onCrawled(function (string $url) use (&$shouldStop) {
if ($url === 'https://example.com/stop') {
$shouldStop = true;
}
})
->start();
public function register() {
$this->app->singleton(Crawler::class, function () {
return new Crawler();
});
}
class CrawlJob implements ShouldQueue {
public function handle() {
Crawler::create('https://example.com')
->limit(1000)
->start();
}
}
Crawler::create('https://example.com')
->onCrawled(function (string $url, CrawlResponse $response) {
Page::updateOrCreate(['url' => $url], [
'status_code' => $response->status(),
'content' => $response->body(),
]);
})
->start();
JavaScript Rendering Overhead:
->withBrowser()) significantly increases crawl time. Use only when necessary.->fake() first to validate logic before enabling JS.Rate Limiting:
->concurrency(2) // Lower for sensitive sites
Robots.txt Ignored by Default:
robots.txt, use:
->respectRobotsTxt()
Duplicate URLs:
->uniqueUrls() to avoid processing the same URL multiple times.Memory Leaks:
->limit() and process results incrementally.->onFailed(function (string $url, RequestException $exception) {
Log::error("Failed to crawl {$url}: {$exception->getMessage()}");
})
->onFailed(function (string $url, RequestException $exception, TransferStatistics $stats) {
if ($stats->transferTimeInMs() > 10000) {
Log::warning("Slow crawl: {$url} took {$stats->transferTimeInMs()}ms");
}
})
Custom URL Filtering:
->filterUrls(function (string $url, string $foundOnUrl) {
return str_contains($url, 'important-section');
})
Modify Request Headers:
->withOptions([
'headers' => [
'User-Agent' => 'MyCrawler/1.0',
'Accept-Language' => 'en-US',
],
])
Handle Redirects:
->followRedirects(false) // Disable by default
->onRedirect(function (string $from, string $to) {
Log::info("Redirect: {$from} -> {$to}");
})
Custom Response Processing:
->onCrawled(function (string $url, CrawlResponse $response) {
$dom = new \DomDocument();
@$dom->loadHTML($response->body());
// Process DOM here
})
->withCache(new FileCache(storage_path('crawler_cache')))
->prioritizeUrls(['/high-priority-page'])
->withoutBrowser() // Skip JS rendering
->withoutCurl() // Use Guzzle only
How can I help you explore Laravel packages today?