spatie/crawler
Fast, concurrent web crawler for PHP. Crawl sites, collect internal URLs with depth limits, and hook into crawl events. Can execute JavaScript via Chrome/Puppeteer for rendered pages. Includes fakes for testing crawl logic without real HTTP requests.
composer require spatie/crawler
use Spatie\Crawler\Crawler;
Crawler::create('https://example.com')
->onCrawled(function (string $url, $response) {
echo "Crawled: {$url}\n";
})
->start();
$urls = Crawler::create('https://example.com')
->internalOnly()
->depth(2)
->foundUrls();
Crawler facade: Central entry point for all crawler operations.CrawlResponse: Inspect crawled pages (status, body, headers).CrawlProgress: Track crawl metrics in real-time.Use observers for structured, reusable crawl logic:
// app/CrawlObservers/UrlExtractor.php
namespace App\CrawlObservers;
use Spatie\Crawler\CrawlObservers\CrawlObserver;
use Spatie\Crawler\CrawlResponse;
class UrlExtractor extends CrawlObserver
{
public function crawled(string $url, CrawlResponse $response): void
{
$this->extractData($response->body());
}
}
Usage:
Crawler::create('https://example.com')
->addObserver(new UrlExtractor())
->start();
Leverage concurrency for performance:
Crawler::create('https://example.com')
->concurrency(10) // 10 concurrent requests
->limit(100) // Stop after 100 URLs
->onCrawled(fn($url, $response) => $this->process($url, $response))
->start();
Filter URLs during crawl:
Crawler::create('https://example.com')
->shouldCrawl(function (string $url) {
return str_contains($url, 'products') || str_contains($url, 'blog');
})
->start();
Enable Puppeteer for SPAs:
Crawler::create('https://example.com')
->withBrowser() // Uses Browsershot under the hood
->onCrawled(fn($url, $response) => $this->saveScreenshot($url, $response->body()))
->start();
Chain observers for multi-step processing:
Crawler::create('https://example.com')
->addObserver(new UrlExtractor())
->addObserver(new DataParser())
->addObserver(new DatabaseSaver())
->start();
Unit test crawlers without HTTP calls:
$fakeResponses = [
'https://example.com' => '<html><a href="/about">About</a></html>',
'https://example.com/about' => '<html>About Page</html>',
];
Crawler::create('https://example.com')
->fake($fakeResponses)
->onCrawled(fn($url, $response) => $this->assertResponse($url, $response))
->start();
Rate Limiting:
429 Too Many Requests. Adjust with ->concurrency(N).->delay(1000) // 1-second delay between batches
Infinite Loops:
->depth(null)) can spiral. Always set:
->limit(1000) // Safety net
->depth(3) // Practical depth limit
JavaScript Overhead:
->withBrowser() adds ~500ms per request. Use sparingly:
->shouldUseBrowser(fn($url) => str_contains($url, 'dashboard'))
Robots.txt Ignored:
robots.txt is respected. Disable with:
->ignoreRobotsTxt()
Memory Leaks:
->depth(5)) may exhaust memory. Use:
->memoryLimit(512) // 512MB limit
Log Failed Requests:
->onFailed(fn($url, $exception) => Log::error("Failed: {$url}", ['exception' => $exception]))
Inspect CrawlProgress:
->onCrawled(fn($url, $response, $progress) => Log::info("Progress: {$progress->urlsProcessed}/{$progress->urlsFound}"))
Validate URLs:
->shouldCrawl() to filter malformed URLs:
->shouldCrawl(fn($url) => filter_var($url, FILTER_VALIDATE_URL) !== false)
Custom Response Handling:
Extend CrawlResponse for domain-specific logic:
class MyCrawlResponse extends CrawlResponse
{
public function extractProductData(): array
{
return json_decode($this->body(), true);
}
}
Usage:
->onCrawled(fn($url, $response) => $response->extractProductData())
Middleware for Requests: Add headers/cookies globally:
use Spatie\Crawler\Crawler;
use GuzzleHttp\Client;
$client = new Client([
'headers' => ['User-Agent' => 'MyCrawler/1.0'],
'cookies' => ['session_id' => 'abc123'],
]);
Crawler::create('https://example.com')
->withClient($client)
->start();
Queue Integration: Offload crawls to Laravel queues:
// app/Jobs/CrawlJob.php
public function handle()
{
Crawler::create($this->url)
->onCrawled(fn($url, $response) => $this->dispatch(new ProcessPage($url, $response)))
->start();
}
->concurrency(20)) speeds up crawls but increases server load.->depth(1)) are faster but miss nested content.->fake() for testing, but note it doesn’t replicate:
php artisan vendor:publish --provider="Spatie\Crawler\CrawlerServiceProvider"
Customize in config/crawler.php:
'default_concurrency' => 10,
'default_timeout' => 30,
'user_agent' => 'MyAppCrawler/1.0',
How can I help you explore Laravel packages today?