Weave Code
Code Weaver
Helps Laravel developers discover, compare, and choose open-source packages. See popularity, security, maintainers, and scores at a glance to make better decisions.
Feedback
Share your thoughts, report bugs, or suggest improvements.
Subject
Message

Crawler Laravel Package

spatie/crawler

Fast, concurrent web crawler for PHP. Crawl sites, collect internal URLs with depth limits, and hook into crawl events. Can execute JavaScript via Chrome/Puppeteer for rendered pages. Includes fakes for testing crawl logic without real HTTP requests.

View on GitHub
Deep Wiki
Context7

spatie/crawler is a fast, flexible PHP web crawler for discovering and processing links on a site. It uses Guzzle promises to crawl multiple URLs concurrently, making it suitable for large link graphs and automated site audits.

Need to crawl modern frontends? The crawler can execute JavaScript-rendered pages via Chrome + Puppeteer (through Browsershot), and it also supports faking responses to test crawl logic without real HTTP requests.

  • Concurrent crawling with configurable throughput
  • Crawl callbacks to inspect each URL and response status/body
  • Filter to internal-only, set depth limits, and collect discovered URLs
  • JavaScript execution for SPA/SSR sites using headless Chrome
  • Built-in HTTP fake mode for deterministic testing
Frequently asked questions about Crawler
How do I crawl a Laravel site to extract all internal URLs with depth limits?
Use the `internalOnly()` and `depth()` methods on the `Crawler` instance. For example, `Crawler::create('https://example.com')->internalOnly()->depth(3)->foundUrls()` will return all internal URLs up to 3 levels deep. This is useful for site mapping or SEO audits.
Can spatie/crawler handle JavaScript-rendered pages like React or Angular apps?
Yes, the crawler supports JavaScript execution via Chrome and Puppeteer (through Browsershot). Enable it by configuring the crawler to use the `Browsershot` driver, which renders pages before extracting links. This is ideal for modern SPAs or SSR applications.
What Laravel versions does spatie/crawler support?
The package is compatible with Laravel 8.x, 9.x, and 10.x. It follows Laravel’s semantic versioning, so minor updates typically align with Laravel’s release cycles. Always check the package’s `composer.json` for the latest supported versions.
How do I test crawl logic without hitting real APIs?
Use the `fake()` method to simulate HTTP responses. Pass an associative array where keys are URLs and values are HTML strings or responses. For example, `Crawler::create('https://example.com')->fake(['https://example.com' => '<html>...</html>'])->start()` lets you test callbacks without network requests.
Is spatie/crawler suitable for large-scale crawls (e.g., 10,000+ pages)?
Yes, but it requires configuration. Use Laravel’s queue system (e.g., `dispatch(new CrawlJob($url))`) to process URLs concurrently across workers. Monitor memory usage, as Puppeteer can be resource-intensive. For distributed crawls, consider chunking URLs or using serverless functions.
How do I integrate crawled data into Laravel’s database?
Use the `onCrawled` callback to process each response and persist data via Eloquent. For example, `Crawler::create('https://example.com')->onCrawled(fn($url, $response) => CrawledPage::create(['url' => $url, 'content' => $response->body()]))->start()`. This works well for SEO tools or content archives.
What’s the best way to avoid rate-limiting or IP bans while crawling?
Respect `robots.txt` and implement delays between requests using `Crawler::create()->delay(2)` (seconds). Rotate user agents with `Http::withOptions(['headers' => ['User-Agent' => '...']])` and consider proxy rotation for large crawls. Exponential backoff for failed requests is also recommended.
Can I use spatie/crawler in Laravel Artisan commands?
Absolutely. Create a custom Artisan command (e.g., `php artisan crawl:seo`) and instantiate the crawler inside it. This is useful for CLI-driven tasks like scheduled site audits or data extraction pipelines.
Are there alternatives to spatie/crawler for Laravel?
For HTTP-only crawling, consider `Guzzle` with custom logic or `Symfony Panther` (for browser automation). For simpler needs, `Laravel Scout` can index URLs via custom importers. However, spatie/crawler stands out for its Laravel integration, concurrency, and JavaScript support.
How do I handle dynamic content that breaks between crawls (e.g., session-dependent pages)?
Use Puppeteer’s headless Chrome to log in or set cookies before crawling. Pass authentication tokens via `Browsershot` configuration (e.g., `Browsershot::html($url)->setOption('auth', ['user', 'pass'])`). For session persistence, consider storing cookies and replaying them per request.
Weaver

How can I help you explore Laravel packages today?

Conversation history is not saved when not logged in.
Prompt
Add packages to context
No packages found.
davejamesmiller/laravel-breadcrumbs
artisanry/parsedown
christhompsontldr/phpsdk
enqueue/dsn
bunny/bunny
enqueue/test
enqueue/null
enqueue/amqp-tools
milesj/emojibase
bower-asset/punycode
bower-asset/inputmask
bower-asset/jquery
bower-asset/yii2-pjax
laravel/nova
spatie/laravel-mailcoach
spatie/laravel-superseeder
laravel/liferaft
nst/json-test-suite
danielmiessler/sec-lists
jackalope/jackalope-transport