Weave Code
Code Weaver
Helps Laravel developers discover, compare, and choose open-source packages. See popularity, security, maintainers, and scores at a glance to make better decisions.
Feedback
Share your thoughts, report bugs, or suggest improvements.
Subject
Message

Darvin Crawler Bundle Laravel Package

darvinstudio/darvin-crawler-bundle

View on GitHub
Deep Wiki
Context7

Getting Started

First Steps

  1. Installation Add the bundle via Composer:

    composer require darvinstudio/darvin-crawler-bundle
    

    Register the bundle in config/bundles.php (Laravel uses autoloading, so no manual registration is needed in Symfony).

  2. Configuration Publish the default config (if needed) and update config/packages/dev/darvin_crawler.yaml:

    darvin_crawler:
        default_uri: https://your-website.com  # Set your site's base URL
        blacklists:
            parse: ["/admin/", "/login/"]      # Skip parsing these paths
            visit: ["/private/", "/api/"]     # Skip crawling these paths
    
  3. First Run Execute the crawler on your default URI:

    php artisan darvin:crawler:crawl
    

    Verify output for broken links (HTTP 4xx/5xx responses).


Implementation Patterns

Daily Workflows

  1. Scheduled Crawling Add a cron job (Linux/macOS) or Task Scheduler (Windows) to run weekly:

    0 3 * * 0 php artisan darvin:crawler:crawl --env=production >> /var/log/crawler.log 2>&1
    

    Tip: Use --env=production to avoid dev-specific issues.

  2. Targeted Crawls Focus on critical sections (e.g., public-facing routes):

    php artisan darvin:crawler:crawl https://your-website.com/blog --limit=100
    
    • --limit: Restrict depth/links (default: 50).
    • -v: Verbose mode (shows all visited links).
  3. Integration with CI/CD Trigger crawls post-deploy (e.g., GitHub Actions) to catch regressions:

    - name: Run link crawler
      run: php artisan darvin:crawler:crawl --env=staging
    
  4. Custom Output Handling Pipe results to a file or API:

    php artisan darvin:crawler:crawl | grep "ERROR" > broken_links.txt
    

    Or extend the command to log to a database (see Extension Points).


Advanced Patterns

  1. Dynamic Blacklists Load blacklists from a database or config file:

    # config/packages/dev/darvin_crawler.yaml
    blacklists:
        parse:
            - '%env(RESERVED_PATHS)%'  # Use env vars for flexibility
    

    Set in .env:

    RESERVED_PATHS=/admin/,/private/
    
  2. Parallel Crawling Use Laravel Queues to distribute crawling across workers:

    // app/Console/Commands/CrawlCommand.php
    public function handle() {
        $uri = $this->argument('uri') ?? config('darvin_crawler.default_uri');
        dispatch(new CrawlJob($uri))->onQueue('crawler');
    }
    
  3. Custom HTTP Client Override the default client (e.g., for auth headers):

    // config/packages/dev/darvin_crawler.yaml
    client:
        options:
            headers:
                Authorization: "Bearer token123"
    

Gotchas and Tips

Common Pitfalls

  1. Rate Limiting

    • Issue: Aggressive crawling may trigger 429 responses.
    • Fix: Add delays or use --delay=100 (milliseconds) in the command.
  2. Relative URLs

    • Issue: The crawler may fail on relative links (e.g., /about).
    • Fix: Configure the base URI explicitly:
      darvin_crawler:
          base_uri: https://your-website.com
      
  3. JavaScript-Rendered Content

    • Issue: SPAs or JS-heavy sites won’t be crawled.
    • Workaround: Use a headless browser (e.g., Puppeteer) via a custom crawler class.
  4. Blacklist Overlap

    • Issue: parse and visit blacklists may conflict.
    • Tip: Use visit for broad exclusions (e.g., /api/) and parse for fine-grained control (e.g., /admin/settings).

Debugging Tips

  1. Verbose Mode Always start with -v to inspect crawled paths:

    php artisan darvin:crawler:crawl -v
    
  2. Log Levels Enable debug logging in config/logging.php:

    'channels' => [
        'single' => [
            'level' => 'debug',
        ],
    ],
    
  3. HTTP Debugging Use --dump-server to inspect raw responses:

    php artisan darvin:crawler:crawl --dump-server
    

Extension Points

  1. Custom Crawler Class Override the default crawler to add logic (e.g., authentication):

    // src/Crawler/CustomCrawler.php
    namespace App\Crawler;
    
    use DarvinCrawlerBundle\Crawler\Crawler;
    
    class CustomCrawler extends Crawler {
        protected function getClientOptions() {
            $options = parent::getClientOptions();
            $options['headers']['X-Custom-Header'] = 'value';
            return $options;
        }
    }
    

    Bind it in config/packages/dev/darvin_crawler.yaml:

    crawler_class: App\Crawler\CustomCrawler
    
  2. Event Listeners Listen for darvin_crawler.link_found or darvin_crawler.link_error events to log to a database:

    // app/Providers/EventServiceProvider.php
    protected $listen = [
        'darvin_crawler.link_error' => [
            \App\Listeners\LogBrokenLink::class,
        ],
    ];
    
  3. Output Formatters Extend the command to output JSON or CSV:

    // app/Console/Commands/CrawlCommand.php
    protected function formatOutput(array $links) {
        return json_encode($links);
    }
    

Pro Tips

  • Test Locally First: Use --limit=10 to validate before full crawls.
  • Monitor Performance: Large sites may hit memory limits; increase memory_limit in php.ini.
  • Combine with Other Tools: Use alongside Laravel Debugbar or Sentry for deeper insights.
  • Exclude Assets: Add blacklists for /images/, /css/, /js/ to speed up crawls.
Weaver

How can I help you explore Laravel packages today?

Conversation history is not saved when not logged in.
Prompt
Add packages to context
No packages found.
hamzi/corewatch
minionfactory/raw-hydrator
hexters/coinpayment
rjcodes/rjcms
act-training/laravel-permissions-manager
alimarchal/laravel-chart-of-accounts
babenkoivan/elastic-scout-driver
mkwebdesign/filament-watchdog-v5
renatomarinho/laravel-page-speed
zedmagdy/filament-business-hours
renatovdemoura/blade-elements-ui
devgeek/beacon-admin
benjamin-rqt/data-watcher-bundle
atriumphp/atrium
sandermuller/package-boost-laravel
sandermuller/boost-skills
redaxo/core
yusufgenc/filament-api-forge
l3aro/rating-star-for-filament
leek/filament-subtenant-scope