- How do I integrate Spatie Crawler into a Laravel application for concurrent scraping?
- Use `Crawler::create('url')->dispatch()` to queue crawl jobs in Laravel queues (e.g., Horizon). Configure concurrency with `->concurrency(10)` and leverage Laravel’s async workers for scalability. For real-time processing, hook into `onCrawled` callbacks to store results in Eloquent models or trigger events.
- Can Spatie Crawler handle JavaScript-heavy sites like React or Angular SPAs?
- Yes, enable Puppeteer/Chrome rendering via `->withBrowser()` or `->withPuppeteer()`. Requires Docker/Node.js setup for Chrome binary. Test with a sample of target pages to ensure stability, as complex SPAs may need adjustments like timeouts or headless Chrome flags.
- What Laravel versions does Spatie Crawler support, and are there breaking changes?
- Supports Laravel 8.x–10.x. Check the [changelog](https://github.com/spatie/crawler/blob/main/CHANGELOG.md) for version-specific updates. Breaking changes are rare but may affect Puppeteer or Guzzle dependencies. Always update dependencies incrementally in a staging environment.
- How do I test crawl logic without hitting external APIs during development?
- Use the `->fake()` method to mock responses with static HTML. Example: `->fake(['url' => '<html>...</html>'])`. This bypasses network requests entirely, making unit tests fast and reliable. Combine with Laravel’s `Http::fake()` for broader HTTP mocking.
- What’s the best way to store scraped data in Laravel using Spatie Crawler?
- Use `onCrawled` callbacks to save data to Eloquent models or Laravel’s filesystem/database. For large crawls, batch inserts with `DB::transaction()` or queue jobs to avoid timeouts. Validate scraped data with Laravel’s validation rules before storage.
- How do I avoid rate limiting or IP bans when crawling large sites?
- Respect `robots.txt` by checking `CrawlResponse::isRobotsTxtDisallowed()`. Add delays with `->crawlDelay(2)` (seconds) and rotate IPs using Laravel’s HTTP client middleware or packages like `spatie/proxy`. Monitor failed requests with `onFailed` callbacks.
- Can I run Spatie Crawler in a distributed Laravel setup (e.g., multiple queue workers)?
- Yes, but manage crawl state carefully. Avoid shared closures (e.g., `shouldStopCallback`) in distributed setups; use database-backed state (e.g., a `crawls` table) or Laravel’s cache. For large crawls, split URLs across workers using `->concurrency()` and queue batching.
- What are the resource requirements for JavaScript rendering with Puppeteer?
- Each Puppeteer instance consumes ~200MB RAM. For high-volume crawls, limit concurrency (e.g., `->concurrency(5)`) or use Docker to isolate Chrome instances. Monitor memory with `memory_get_usage()` or tools like Blackfire. Consider serverless options (e.g., AWS Lambda) for sporadic crawls.
- How do I debug failed crawls or Puppeteer errors in production?
- Use `onFailed` callbacks to log errors to Laravel Telescope or Monolog. For Puppeteer issues, check Chrome logs via `->withPuppeteerOptions(['args' => ['--log-level=debug']])`. Enable verbose Guzzle logging with `->withGuzzleOptions(['debug' => true])` during development.
- Are there alternatives to Spatie Crawler for Laravel, and when should I choose them?
- For simple HTML scraping, consider `symfony/dom-crawler` or `php-crawler/php-crawler`. For headless browsing, `spatie/browsershot` (Puppeteer-only) is lighter but lacks crawling features. Use Spatie Crawler if you need concurrent requests, depth control, and JS rendering in a Laravel-native way.