Weave Code
Code Weaver
Helps Laravel developers discover, compare, and choose open-source packages. See popularity, security, maintainers, and scores at a glance to make better decisions.
Feedback
Share your thoughts, report bugs, or suggest improvements.
Subject
Message

Crawler Laravel Package

spatie/crawler

PHP web crawler that discovers links concurrently via Guzzle, with optional JavaScript rendering powered by Chrome/Puppeteer. Configure depth, internal-only rules, and callbacks for per-page handling, plus a fake mode to test crawl logic without real HTTP requests.

View on GitHub
Deep Wiki
Context7

Technical Evaluation

Architecture Fit

  • Strengths:

    • Decoupled Design: The crawler operates independently of business logic, making it ideal for modular systems (e.g., SEO tools, content aggregation, or link analysis).
    • Event-Driven: Callbacks and observers enable integration with Laravel’s event system (e.g., Event::dispatch()) or queue workers (e.g., shouldQueue).
    • JavaScript Support: Leverages spatie/browsershot (Puppeteer/Chrome) for SPAs or dynamic content, critical for modern web scraping.
    • Progress Tracking: Built-in CrawlProgress and FinishReason align with Laravel’s logging (e.g., Log::info()) and monitoring needs.
    • Testing-First: Fake responses simplify unit/integration testing, reducing reliance on external APIs during development.
  • Weaknesses:

    • No Native Laravel Integration: Requires manual setup (e.g., service providers, config files) for Laravel-specific features (e.g., caching, queues).
    • Resource Intensive: Concurrent requests (Guzzle promises) and JS rendering may strain shared hosting or low-memory environments.
    • Rate Limiting: No built-in throttling; must be implemented via callbacks (e.g., shouldStopCallback).

Integration Feasibility

  • Laravel Stack Compatibility:
    • HTTP Client: Works seamlessly with Laravel’s HTTP client (Guzzle under the hood).
    • Queues: Can integrate with Laravel Queues (e.g., Crawler::create()->onCrawled(fn() => dispatch(new ProcessUrlJob($url)))).
    • Cache: Supports caching responses (e.g., Cache::remember() for foundUrls()).
    • Database: Results can be stored in Eloquent models or database tables (e.g., urls table with url, status, last_crawled_at).
  • Third-Party Dependencies:
    • Puppeteer/Chrome: Requires Docker or system-level Chrome installation for JS rendering (adds deployment complexity).
    • Guzzle: Already a Laravel dependency (no additional setup).

Technical Risk

  • High:
    • Scalability: Concurrent requests may hit API rate limits or server resource caps (e.g., max_execution_time in PHP).
    • JS Rendering Overhead: Puppeteer adds ~500MB+ memory per instance; may need Kubernetes/container orchestration for large crawls.
    • Maintenance: Custom observers/callbacks may diverge from package updates (e.g., breaking changes in CrawlResponse).
  • Mitigation:
    • Use Laravel’s queue:work with sleep() in callbacks to throttle requests.
    • Containerize Puppeteer (e.g., Docker) for isolation.
    • Abstract crawler logic into a service class to isolate changes.

Key Questions

  1. Use Case Clarity:
    • Is this for internal (e.g., site health checks) or external (e.g., competitor scraping) crawling?
    • Are there legal/ethical concerns (e.g., robots.txt, terms of service)?
  2. Scale Requirements:
    • What’s the target crawl depth and concurrency level?
    • Is persistence needed (e.g., storing results in a database)?
  3. Laravel-Specific Needs:
    • Should results trigger Laravel events (e.g., UrlCrawled) or notifications?
    • Will crawls run in background jobs or scheduled tasks?
  4. Failure Handling:
    • How should failed requests be retried (e.g., exponential backoff)?
    • Are there SLA requirements for crawl completion?

Integration Approach

Stack Fit

  • Laravel Integration Points:
    • Service Provider: Register the crawler as a singleton with Laravel’s container:
      $this->app->singleton(Crawler::class, fn() => new Crawler());
      
    • Config File: Add crawler settings (e.g., config/crawler.php) for:
      'default_depth' => 3,
      'max_concurrency' => 10,
      'use_javascript' => env('CRAWLER_USE_JS', false),
      
    • Artisan Command: Create a crawl:run command for CLI execution:
      php artisan crawl:run https://example.com --depth=2
      
    • Queue Jobs: Dispatch crawls to queues for async processing:
      CrawlerJob::dispatch('https://example.com')->onQueue('crawls');
      
  • Database Storage:
    • Use Eloquent models to store crawled URLs, statuses, and metadata:
      class CrawledUrl extends Model {
          protected $fillable = ['url', 'status_code', 'last_crawled_at', 'depth'];
      }
      
    • Example observer:
      public function crawled(string $url, CrawlResponse $response) {
          CrawledUrl::updateOrCreate(
              ['url' => $url],
              ['status_code' => $response->status(), 'last_crawled_at' => now()]
          );
      }
      

Migration Path

  1. Phase 1: Proof of Concept
    • Test basic crawling with fake responses (no external HTTP calls).
    • Validate foundUrls() and onCrawled callbacks.
  2. Phase 2: Laravel Integration
    • Add service provider/config.
    • Implement a simple observer to log results.
  3. Phase 3: Scaling
    • Add queue support for async crawls.
    • Implement retries for failed requests (e.g., using spatie/laravel-activitylog).
  4. Phase 4: Productionization
    • Containerize Puppeteer (if using JS rendering).
    • Add monitoring (e.g., Laravel Horizon for queue stats).

Compatibility

  • Laravel Versions: Tested with Laravel 8+ (PHP 8.0+). May require adjustments for older versions.
  • PHP Extensions: Requires dom, fileinfo, and mbstring (standard in Laravel).
  • Dependencies:
    • guzzlehttp/guzzle (Laravel dependency).
    • spatie/browsershot (only if using JS rendering; ~200MB overhead).

Sequencing

  1. Setup:
    • Install package: composer require spatie/crawler.
    • Publish config: php artisan vendor:publish --provider="Spatie\Crawler\CrawlerServiceProvider".
  2. Development:
    • Use fake() for testing.
    • Implement observers for logging/storage.
  3. Deployment:
    • Configure environment variables (e.g., CRAWLER_CONCURRENCY=5).
    • Set up cron jobs or Laravel schedules for periodic crawls:
      $schedule->command('crawl:run {url}')->daily();
      

Operational Impact

Maintenance

  • Pros:
    • Minimal Boilerplate: Observers and callbacks centralize logic.
    • Testability: Fake responses enable CI/CD-friendly testing.
  • Cons:
    • Observer Sprawl: Custom observers may become unwieldy without clear separation (e.g., LoggingObserver, DatabaseObserver).
    • Dependency Updates: spatie/crawler and spatie/browsershot may introduce breaking changes.
  • Mitigation:
    • Use Laravel’s PackageServiceProvider to isolate crawler logic.
    • Document observer contracts (e.g., interface CrawlObserverContract).

Support

  • Debugging:
    • Leverage CrawlProgress for real-time monitoring (e.g., log to Laravel Log).
    • Use FinishReason to handle crawl termination gracefully.
  • Common Issues:
    • Timeouts: Adjust guzzle timeout settings or implement circuit breakers.
    • JS Rendering Failures: Monitor Puppeteer logs (e.g., stderr).
    • Duplicate URLs: Use ->uniqueUrls() or database unique constraints.

Scaling

  • Horizontal Scaling:
    • Distribute crawls across Laravel queue workers.
    • Use ->limit() to batch requests (e.g., 100 URLs per job).
  • Vertical Scaling:
    • Increase max_concurrency (default: 10) for faster crawls (but monitor server load).
    • Offload JS rendering to a separate service (e.g., AWS Lambda with Puppeteer).
  • Database Load:
    • Batch inserts for CrawledUrl (e.g., DB::table('urls')->insert($batch)).

Failure Modes

Failure Type Impact Mitigation
HTTP Rate Limiting Crawl halts or gets blocked. Implement exponential backoff in callbacks.
Puppeteer Crashes JS rendering fails silently. Use health checks (e.g.,
Weaver

How can I help you explore Laravel packages today?

Conversation history is not saved when not logged in.
Prompt
Add packages to context
No packages found.
elnasnato/laraliveui
labrodev/rest-sdk
sampaui/sampaui
babelqueue/php-sdk
facebook/capi-param-builder-php
babelqueue/symfony
hamzi/corewatch
minionfactory/raw-hydrator
hexters/coinpayment
rjcodes/rjcms
act-training/laravel-permissions-manager
alimarchal/laravel-chart-of-accounts
babenkoivan/elastic-scout-driver
mkwebdesign/filament-watchdog-v5
renatomarinho/laravel-page-speed
zedmagdy/filament-business-hours
renatovdemoura/blade-elements-ui
devgeek/beacon-admin
benjamin-rqt/data-watcher-bundle
atriumphp/atrium