spatie/crawler
PHP web crawler that discovers links concurrently via Guzzle, with optional JavaScript rendering powered by Chrome/Puppeteer. Configure depth, internal-only rules, and callbacks for per-page handling, plus a fake mode to test crawl logic without real HTTP requests.
When crawling a site, the crawler stores URLs to be crawled in a queue. By default, this queue is stored in memory using the built-in ArrayCrawlQueue.
The built-in ArrayCrawlQueue normalizes URLs before using them as deduplication keys. This means that https://Example.com/page and https://example.com/page/ are treated as the same URL, preventing redundant requests.
The following normalizations are applied (per RFC 3986):
:80 for http, :443 for https)/)The original URL is preserved on the CrawlUrl object and used for HTTP requests and observer notifications. Only the internal deduplication key uses the normalized form.
If you implement a custom crawl queue, consider applying similar normalizations to avoid crawling duplicate URLs.
When a site is very large you may want to store that queue elsewhere, for example in a database. You can write your own crawl queue by implementing the Spatie\Crawler\CrawlQueues\CrawlQueue interface:
use Spatie\Crawler\Crawler;
Crawler::create('https://example.com')
->crawlQueue(new MyCustomQueue())
->start();
The CrawlQueue interface requires the following methods:
interface CrawlQueue
{
public function add(CrawlUrl $url): self;
public function has(string $url): bool;
public function hasPendingUrls(): bool;
public function getUrlById(mixed $id): CrawlUrl;
public function getPendingUrl(): ?CrawlUrl;
public function hasAlreadyBeenProcessed(CrawlUrl $url): bool;
public function markAsProcessed(CrawlUrl $crawlUrl): void;
public function getProcessedUrlCount(): int;
public function getUrlCount(): int; // total URLs added to the queue
public function getPendingUrlCount(): int; // URLs not yet processed
}
The getUrlCount() and getPendingUrlCount() methods are used by the CrawlProgress object to report queue statistics. See tracking progress for details.
Here are some queue implementations:
How can I help you explore Laravel packages today?