spatie/crawler
PHP web crawler that discovers links concurrently via Guzzle, with optional JavaScript rendering powered by Chrome/Puppeteer. Configure depth, internal-only rules, and callbacks for per-page handling, plus a fake mode to test crawl logic without real HTTP requests.
By default, the crawler only extracts links (<a> tags and some <link> tags) from each page. You can also instruct it to extract images, scripts, stylesheets, and Open Graph images. This is useful for broken asset checking, content auditing, or building a complete inventory of a site's resources.
Use the alsoExtract method to extract additional resource types alongside links:
use Spatie\Crawler\Crawler;
use Spatie\Crawler\CrawlResponse;
use Spatie\Crawler\Enums\ResourceType;
Crawler::create('https://example.com')
->alsoExtract(ResourceType::Image, ResourceType::Stylesheet)
->onCrawled(function (string $url, CrawlResponse $response) {
echo "{$response->resourceType()->value}: {$url}\n";
})
->start();
The available resource types are:
| Type | What it extracts |
|---|---|
ResourceType::Link |
<a> tags, <link rel="next/prev">, <link hreflang> (always included) |
ResourceType::Image |
<img src> and <img data-src> (lazy loaded images) |
ResourceType::Script |
<script src> and <link rel="modulepreload"> |
ResourceType::Stylesheet |
<link rel="stylesheet">, <link type="text/css">, <link as="style"> |
ResourceType::OpenGraphImage |
<meta property="og:image"> and <meta property="twitter:image"> |
To extract everything at once, use extractAll:
Crawler::create('https://example.com')
->extractAll()
->onCrawled(function (string $url, CrawlResponse $response) {
// $response->resourceType() tells you what kind of resource this is
})
->start();
When using observers, the resource type is available through the CrawlResponse:
use Spatie\Crawler\CrawlObservers\CrawlObserver;
use Spatie\Crawler\CrawlProgress;
use Spatie\Crawler\CrawlResponse;
use Spatie\Crawler\Enums\ResourceType;
class AssetChecker extends CrawlObserver
{
public function crawled(
string $url,
CrawlResponse $response,
CrawlProgress $progress,
): void {
if ($response->resourceType() === ResourceType::Image && $response->status() === 404) {
echo "Broken image: {$url} (found on {$response->foundOnUrl()})\n";
}
}
}
When extracting resources (images, scripts, stylesheets, and Open Graph images), the crawler respects the <base href> tag in the HTML. If a page contains <base href="https://example.com/assets/">, relative resource URLs will be resolved against that base URL instead of the page URL.
Links (<a> tags) also respect <base href> through Symfony's DomCrawler.
When the crawler encounters a malformed URL in the HTML (for example, href="https:///invalid"), it will report it through your crawlFailed callback or observer instead of silently ignoring it. The RequestException message will contain the reason the URL could not be parsed.
use GuzzleHttp\Exception\RequestException;
Crawler::create('https://example.com')
->onFailed(function (string $url, RequestException $exception, CrawlProgress $progress) {
if (str_contains($exception->getMessage(), 'Malformed URL')) {
echo "Found malformed URL: {$url}\n";
}
})
->start();
When using foundUrls(), each CrawledUrl includes the resource type:
$urls = Crawler::create('https://example.com')
->extractAll()
->foundUrls();
foreach ($urls as $crawledUrl) {
echo "{$crawledUrl->resourceType->value}: {$crawledUrl->url} ({$crawledUrl->status})\n";
}
How can I help you explore Laravel packages today?