spatie/crawler
PHP web crawler that discovers links concurrently via Guzzle, with optional JavaScript rendering powered by Chrome/Puppeteer. Configure depth, internal-only rules, and callbacks for per-page handling, plus a fake mode to test crawl logic without real HTTP requests.
To improve the speed of the crawl, the package concurrently crawls 10 URLs by default. You can change this number using the concurrency method.
use Spatie\Crawler\Crawler;
Crawler::create('https://example.com')
->concurrency(1) // crawl URLs one by one
->start();
By default, there is no delay between requests. In some cases you might get rate limited when crawling too aggressively. You can add a pause between every request using the delay method. The value is expressed in milliseconds.
use Spatie\Crawler\Crawler;
Crawler::create('https://example.com')
->delay(150) // wait 150ms after every page
->start();
For more control over request pacing, you can use a throttle. A throttle is a class that implements Spatie\Crawler\Throttlers\Throttle. When a throttle is set, it takes precedence over the delay method.
The FixedDelayThrottle works like delay(), but as a class you can pass around and configure independently.
use Spatie\Crawler\Crawler;
use Spatie\Crawler\Throttlers\FixedDelayThrottle;
Crawler::create('https://example.com')
->throttle(new FixedDelayThrottle(delayMs: 150))
->start();
The AdaptiveThrottle adjusts the delay based on how fast the server responds. When the server is slow, the crawler backs off. When it speeds up, the delay decreases. You can configure minimum and maximum bounds.
use Spatie\Crawler\Crawler;
use Spatie\Crawler\Throttlers\AdaptiveThrottle;
Crawler::create('https://example.com')
->throttle(new AdaptiveThrottle(
minDelayMs: 50,
maxDelayMs: 5000,
))
->start();
The delay is calculated as an exponential moving average: (currentDelay + latency) / 2, clamped to the configured bounds.
You can create your own throttle by implementing the Throttle interface:
use Spatie\Crawler\Throttlers\Throttle;
class MyThrottle implements Throttle
{
public function sleep(): void
{
// Called after each response. Pause here.
}
public function recordResponseTime(float $seconds): void
{
// Called with the transfer time of each response.
}
}
By default, URLs without a scheme are prefixed with https. You can change this using the defaultScheme method.
use Spatie\Crawler\Crawler;
Crawler::create('example.com')
->defaultScheme('http')
->start();
How can I help you explore Laravel packages today?