spatie/crawler
PHP web crawler that discovers links concurrently via Guzzle, with optional JavaScript rendering powered by Chrome/Puppeteer. Configure depth, internal-only rules, and callbacks for per-page handling, plus a fake mode to test crawl logic without real HTTP requests.
By default, the crawler continues until it has crawled every page it can find. This behavior might cause issues if you are working in an environment with limitations such as a serverless environment.
You can limit how deep the crawler will go using the depth method.
use Spatie\Crawler\Crawler;
Crawler::create('https://example.com')
->depth(2)
->start();
A depth of 0 means only the start URL will be crawled. A depth of 1 means the start URL and any pages it links to, and so on.
The crawl behavior can be controlled with these options:
limit(): the maximum number of URLs to crawl across all executionslimitPerExecution(): how many URLs to process during the current crawltimeLimit(): the maximum execution time in seconds across all executionstimeLimitPerExecution(): the maximum execution time in seconds for the current crawlWhen any of these limits are reached, the crawler stops and returns a FinishReason from start(). See tracking progress for details.
The limit() method allows you to limit the total number of URLs to crawl, no matter how often you call the crawler.
use Spatie\Crawler\Crawler;
use Spatie\Crawler\Enums\FinishReason;
$queue = <your queue implementation>;
// Crawls 5 URLs and ends.
$reason = Crawler::create('https://example.com')
->crawlQueue($queue)
->limit(5)
->start();
// $reason will be FinishReason::CrawlLimitReached
// Doesn't crawl further as the total limit is reached.
Crawler::create('https://example.com')
->crawlQueue($queue)
->limit(5)
->start();
The limitPerExecution() method limits how many URLs will be crawled in a single execution. This is especially useful when crawling across multiple requests. This code will process 5 pages with each execution, without a total limit of pages to crawl.
use Spatie\Crawler\Crawler;
$queue = <your queue implementation>;
// Crawls 5 URLs and ends.
Crawler::create('https://example.com')
->crawlQueue($queue)
->limitPerExecution(5)
->start();
// Crawls the next 5 URLs and ends.
Crawler::create('https://example.com')
->crawlQueue($queue)
->limitPerExecution(5)
->start();
The timeLimit() method sets the maximum execution time across all executions. The timeLimitPerExecution() method sets the maximum execution time for a single crawl. Both accept a value in seconds.
use Spatie\Crawler\Crawler;
// Stop crawling after 60 seconds total
$reason = Crawler::create('https://example.com')
->timeLimit(60)
->start();
// $reason will be FinishReason::TimeLimitReached if time ran out
// Stop each execution after 30 seconds, but allow resuming
Crawler::create('https://example.com')
->crawlQueue($queue)
->timeLimitPerExecution(30)
->start();
All limits can be combined to control the crawler:
use Spatie\Crawler\Crawler;
$queue = <your queue implementation>;
// Crawls 5 URLs and ends.
Crawler::create('https://example.com')
->crawlQueue($queue)
->limit(10)
->limitPerExecution(5)
->start();
// Crawls the next 5 URLs and ends.
Crawler::create('https://example.com')
->crawlQueue($queue)
->limit(10)
->limitPerExecution(5)
->start();
// Doesn't crawl further as the total limit is reached.
Crawler::create('https://example.com')
->crawlQueue($queue)
->limit(10)
->limitPerExecution(5)
->start();
How can I help you explore Laravel packages today?