Weave Code
Code Weaver
Helps Laravel developers discover, compare, and choose open-source packages. See popularity, security, maintainers, and scores at a glance to make better decisions.
Feedback
Share your thoughts, report bugs, or suggest improvements.
Subject
Message

Crawler Laravel Package

spatie/crawler

PHP web crawler that discovers links concurrently via Guzzle, with optional JavaScript rendering powered by Chrome/Puppeteer. Configure depth, internal-only rules, and callbacks for per-page handling, plus a fake mode to test crawl logic without real HTTP requests.

View on GitHub
Deep Wiki
Context7

title: Filtering URLs weight: 5

By default, the crawler will crawl every URL it finds, including links to external sites. You can control which URLs are crawled using scope helpers or custom crawl profiles.

Scope helpers

The simplest way to filter URLs is with the built-in scope helpers:

use Spatie\Crawler\Crawler;

// Only crawl URLs on the same host
Crawler::create('https://example.com')
    ->internalOnly()
    ->start();

// Crawl URLs on the same host and its subdomains
Crawler::create('https://example.com')
    ->internalOnly()
    ->includeSubdomains()
    ->start();

Matching www and non-www

By default, internalOnly() treats example.com and www.example.com as different hosts. If you want them to be treated as equivalent, chain the matchWww() method:

use Spatie\Crawler\Crawler;

Crawler::create('https://example.com')
    ->internalOnly()
    ->matchWww()
    ->start();

This will crawl links on both example.com and www.example.com. It works in both directions: starting from www.example.com will also include example.com links.

Combining matchWww and includeSubdomains

Both matchWww() and includeSubdomains() can be used together. When includeSubdomains() is enabled, www is stripped from both hosts before the subdomain check. This means blog.example.com will match a base URL of www.example.com.

use Spatie\Crawler\Crawler;

Crawler::create('https://www.example.com')
    ->internalOnly()
    ->matchWww()
    ->includeSubdomains()
    ->start();

This will crawl www.example.com, example.com, blog.example.com, cdn.example.com, and any other subdomain of example.com.

Inline filtering

For custom filtering logic, use the shouldCrawl method with a closure:

use Spatie\Crawler\Crawler;

Crawler::create('https://example.com')
    ->shouldCrawl(function (string $url) {
        return !str_contains($url, '/admin');
    })
    ->start();

Custom crawl profiles

For reusable filtering logic, create a class that implements Spatie\Crawler\CrawlProfiles\CrawlProfile:

use Spatie\Crawler\CrawlProfiles\CrawlProfile;

class MyCustomProfile implements CrawlProfile
{
    public function shouldCrawl(string $url): bool
    {
        return parse_url($url, PHP_URL_HOST) === 'example.com'
            && !str_contains($url, '/private');
    }
}

Then pass it to the crawler:

use Spatie\Crawler\Crawler;

Crawler::create('https://example.com')
    ->crawlProfile(new MyCustomProfile())
    ->start();

This package comes with three built-in profiles:

  • CrawlAllUrls: crawls all URLs on all pages, including external sites (this is the default)
  • CrawlInternalUrls: only crawls URLs on the same host
  • CrawlSubdomains: crawls URLs on the same host and its subdomains

Always crawl and never crawl

Sometimes you need to override your crawl profile for specific URL patterns. The alwaysCrawl and neverCrawl methods accept arrays of patterns (using fnmatch syntax) that take priority over your crawl profile.

use Spatie\Crawler\Crawler;

Crawler::create('https://example.com')
    ->internalOnly()
    ->alwaysCrawl(['https://cdn.example.com/*'])
    ->neverCrawl(['*/admin/*', '*/tmp/*'])
    ->start();

alwaysCrawl patterns bypass both the crawl profile and robots.txt rules. This is useful for checking external assets (like CDN resources) while keeping the crawl scoped to your own site.

neverCrawl patterns block matching URLs from being added to the crawl queue, regardless of what the crawl profile returns.

When a URL matches both an alwaysCrawl and a neverCrawl pattern, alwaysCrawl wins.

Weaver

How can I help you explore Laravel packages today?

Conversation history is not saved when not logged in.
Prompt
Add packages to context
No packages found.
jayeshmepani/jpl-moshier-ephemeris-php
elnasnato/laraliveui
labrodev/rest-sdk
sampaui/sampaui
babelqueue/php-sdk
facebook/capi-param-builder-php
babelqueue/symfony
hamzi/corewatch
minionfactory/raw-hydrator
hexters/coinpayment
rjcodes/rjcms
act-training/laravel-permissions-manager
alimarchal/laravel-chart-of-accounts
babenkoivan/elastic-scout-driver
mkwebdesign/filament-watchdog-v5
renatomarinho/laravel-page-speed
zedmagdy/filament-business-hours
renatovdemoura/blade-elements-ui
devgeek/beacon-admin
benjamin-rqt/data-watcher-bundle