Weave Code
Code Weaver
Helps Laravel developers discover, compare, and choose open-source packages. See popularity, security, maintainers, and scores at a glance to make better decisions.
Feedback
Share your thoughts, report bugs, or suggest improvements.
Subject
Message

Crawler Laravel Package

spatie/crawler

PHP web crawler that discovers links concurrently via Guzzle, with optional JavaScript rendering powered by Chrome/Puppeteer. Configure depth, internal-only rules, and callbacks for per-page handling, plus a fake mode to test crawl logic without real HTTP requests.

View on GitHub
Deep Wiki
Context7

title: Extracting resources weight: 3

By default, the crawler only extracts links (<a> tags and some <link> tags) from each page. You can also instruct it to extract images, scripts, stylesheets, and Open Graph images. This is useful for broken asset checking, content auditing, or building a complete inventory of a site's resources.

Extracting specific resource types

Use the alsoExtract method to extract additional resource types alongside links:

use Spatie\Crawler\Crawler;
use Spatie\Crawler\CrawlResponse;
use Spatie\Crawler\Enums\ResourceType;

Crawler::create('https://example.com')
    ->alsoExtract(ResourceType::Image, ResourceType::Stylesheet)
    ->onCrawled(function (string $url, CrawlResponse $response) {
        echo "{$response->resourceType()->value}: {$url}\n";
    })
    ->start();

The available resource types are:

Type What it extracts
ResourceType::Link <a> tags, <link rel="next/prev">, <link hreflang> (always included)
ResourceType::Image <img src> and <img data-src> (lazy loaded images)
ResourceType::Script <script src> and <link rel="modulepreload">
ResourceType::Stylesheet <link rel="stylesheet">, <link type="text/css">, <link as="style">
ResourceType::OpenGraphImage <meta property="og:image"> and <meta property="twitter:image">

Extracting all resource types

To extract everything at once, use extractAll:

Crawler::create('https://example.com')
    ->extractAll()
    ->onCrawled(function (string $url, CrawlResponse $response) {
        // $response->resourceType() tells you what kind of resource this is
    })
    ->start();

Resource types in observers

When using observers, the resource type is available through the CrawlResponse:

use Spatie\Crawler\CrawlObservers\CrawlObserver;
use Spatie\Crawler\CrawlProgress;
use Spatie\Crawler\CrawlResponse;
use Spatie\Crawler\Enums\ResourceType;

class AssetChecker extends CrawlObserver
{
    public function crawled(
        string $url,
        CrawlResponse $response,
        CrawlProgress $progress,
    ): void {
        if ($response->resourceType() === ResourceType::Image && $response->status() === 404) {
            echo "Broken image: {$url} (found on {$response->foundOnUrl()})\n";
        }
    }
}

Base href support

When extracting resources (images, scripts, stylesheets, and Open Graph images), the crawler respects the <base href> tag in the HTML. If a page contains <base href="https://example.com/assets/">, relative resource URLs will be resolved against that base URL instead of the page URL.

Links (<a> tags) also respect <base href> through Symfony's DomCrawler.

Malformed URLs

When the crawler encounters a malformed URL in the HTML (for example, href="https:///invalid"), it will report it through your crawlFailed callback or observer instead of silently ignoring it. The RequestException message will contain the reason the URL could not be parsed.

use GuzzleHttp\Exception\RequestException;

Crawler::create('https://example.com')
    ->onFailed(function (string $url, RequestException $exception, CrawlProgress $progress) {
        if (str_contains($exception->getMessage(), 'Malformed URL')) {
            echo "Found malformed URL: {$url}\n";
        }
    })
    ->start();

Resource types in collected URLs

When using foundUrls(), each CrawledUrl includes the resource type:

$urls = Crawler::create('https://example.com')
    ->extractAll()
    ->foundUrls();

foreach ($urls as $crawledUrl) {
    echo "{$crawledUrl->resourceType->value}: {$crawledUrl->url} ({$crawledUrl->status})\n";
}
Weaver

How can I help you explore Laravel packages today?

Conversation history is not saved when not logged in.
Prompt
Add packages to context
No packages found.
emuniq/filament-browser-notifications
syriable/filament-translator
hungnm28/livewire-form
wenprise/eloquent
crudly/encrypted
fadion/bouncy
cuci/prototurk-sdk
gos/pubsub-router-bundle
cuci/prototurk-sdk-symfony
clementtalleu/easyadmin-markdown-bundle
codeflextech/permission-manager
karnoweb/livewire-datepicker
sayedenam/sayed-dashboard
milito/query-filter
apiboxsym/user-bundle
apiboxsym/health-check-bundle
jayeshmepani/jpl-moshier-ephemeris-php
elnasnato/laraliveui
labrodev/rest-sdk
sampaui/sampaui