Weave Code
Code Weaver
Helps Laravel developers discover, compare, and choose open-source packages. See popularity, security, maintainers, and scores at a glance to make better decisions.
Feedback
Share your thoughts, report bugs, or suggest improvements.
Subject
Message

Crawler Laravel Package

spatie/crawler

PHP web crawler that discovers links concurrently via Guzzle, with optional JavaScript rendering powered by Chrome/Puppeteer. Configure depth, internal-only rules, and callbacks for per-page handling, plus a fake mode to test crawl logic without real HTTP requests.

View on GitHub
Deep Wiki
Context7

title: Custom crawl queue weight: 5

When crawling a site, the crawler stores URLs to be crawled in a queue. By default, this queue is stored in memory using the built-in ArrayCrawlQueue.

URL normalization

The built-in ArrayCrawlQueue normalizes URLs before using them as deduplication keys. This means that https://Example.com/page and https://example.com/page/ are treated as the same URL, preventing redundant requests.

The following normalizations are applied (per RFC 3986):

  • Lowercasing scheme and host
  • Removing default ports (:80 for http, :443 for https)
  • Stripping trailing slashes (except for the root /)
  • Removing empty query strings
  • Stripping URL fragments

The original URL is preserved on the CrawlUrl object and used for HTTP requests and observer notifications. Only the internal deduplication key uses the normalized form.

If you implement a custom crawl queue, consider applying similar normalizations to avoid crawling duplicate URLs.

When a site is very large you may want to store that queue elsewhere, for example in a database. You can write your own crawl queue by implementing the Spatie\Crawler\CrawlQueues\CrawlQueue interface:

use Spatie\Crawler\Crawler;

Crawler::create('https://example.com')
    ->crawlQueue(new MyCustomQueue())
    ->start();

The CrawlQueue interface requires the following methods:

interface CrawlQueue
{
    public function add(CrawlUrl $url): self;
    public function has(string $url): bool;
    public function hasPendingUrls(): bool;
    public function getUrlById(mixed $id): CrawlUrl;
    public function getPendingUrl(): ?CrawlUrl;
    public function hasAlreadyBeenProcessed(CrawlUrl $url): bool;
    public function markAsProcessed(CrawlUrl $crawlUrl): void;
    public function getProcessedUrlCount(): int;
    public function getUrlCount(): int;        // total URLs added to the queue
    public function getPendingUrlCount(): int;  // URLs not yet processed
}

The getUrlCount() and getPendingUrlCount() methods are used by the CrawlProgress object to report queue statistics. See tracking progress for details.

Here are some queue implementations:

Weaver

How can I help you explore Laravel packages today?

Conversation history is not saved when not logged in.
Prompt
Add packages to context
No packages found.
jayeshmepani/jpl-moshier-ephemeris-php
elnasnato/laraliveui
labrodev/rest-sdk
sampaui/sampaui
babelqueue/php-sdk
facebook/capi-param-builder-php
babelqueue/symfony
hamzi/corewatch
minionfactory/raw-hydrator
hexters/coinpayment
rjcodes/rjcms
act-training/laravel-permissions-manager
alimarchal/laravel-chart-of-accounts
babenkoivan/elastic-scout-driver
mkwebdesign/filament-watchdog-v5
renatomarinho/laravel-page-speed
zedmagdy/filament-business-hours
renatovdemoura/blade-elements-ui
devgeek/beacon-admin
benjamin-rqt/data-watcher-bundle