- How do I crawl a Laravel site to extract all internal URLs with depth limits?
- Use the `internalOnly()` and `depth()` methods on the `Crawler` instance. For example, `Crawler::create('https://example.com')->internalOnly()->depth(3)->foundUrls()` will return all internal URLs up to 3 levels deep. This is useful for site mapping or SEO audits.
- Can spatie/crawler handle JavaScript-rendered pages like React or Angular apps?
- Yes, the crawler supports JavaScript execution via Chrome and Puppeteer (through Browsershot). Enable it by configuring the crawler to use the `Browsershot` driver, which renders pages before extracting links. This is ideal for modern SPAs or SSR applications.
- What Laravel versions does spatie/crawler support?
- The package is compatible with Laravel 8.x, 9.x, and 10.x. It follows Laravel’s semantic versioning, so minor updates typically align with Laravel’s release cycles. Always check the package’s `composer.json` for the latest supported versions.
- How do I test crawl logic without hitting real APIs?
- Use the `fake()` method to simulate HTTP responses. Pass an associative array where keys are URLs and values are HTML strings or responses. For example, `Crawler::create('https://example.com')->fake(['https://example.com' => '<html>...</html>'])->start()` lets you test callbacks without network requests.
- Is spatie/crawler suitable for large-scale crawls (e.g., 10,000+ pages)?
- Yes, but it requires configuration. Use Laravel’s queue system (e.g., `dispatch(new CrawlJob($url))`) to process URLs concurrently across workers. Monitor memory usage, as Puppeteer can be resource-intensive. For distributed crawls, consider chunking URLs or using serverless functions.
- How do I integrate crawled data into Laravel’s database?
- Use the `onCrawled` callback to process each response and persist data via Eloquent. For example, `Crawler::create('https://example.com')->onCrawled(fn($url, $response) => CrawledPage::create(['url' => $url, 'content' => $response->body()]))->start()`. This works well for SEO tools or content archives.
- What’s the best way to avoid rate-limiting or IP bans while crawling?
- Respect `robots.txt` and implement delays between requests using `Crawler::create()->delay(2)` (seconds). Rotate user agents with `Http::withOptions(['headers' => ['User-Agent' => '...']])` and consider proxy rotation for large crawls. Exponential backoff for failed requests is also recommended.
- Can I use spatie/crawler in Laravel Artisan commands?
- Absolutely. Create a custom Artisan command (e.g., `php artisan crawl:seo`) and instantiate the crawler inside it. This is useful for CLI-driven tasks like scheduled site audits or data extraction pipelines.
- Are there alternatives to spatie/crawler for Laravel?
- For HTTP-only crawling, consider `Guzzle` with custom logic or `Symfony Panther` (for browser automation). For simpler needs, `Laravel Scout` can index URLs via custom importers. However, spatie/crawler stands out for its Laravel integration, concurrency, and JavaScript support.
- How do I handle dynamic content that breaks between crawls (e.g., session-dependent pages)?
- Use Puppeteer’s headless Chrome to log in or set cookies before crawling. Pass authentication tokens via `Browsershot` configuration (e.g., `Browsershot::html($url)->setOption('auth', ['user', 'pass'])`). For session persistence, consider storing cookies and replaying them per request.