darvinstudio/darvin-crawler-bundle
Installation Add the bundle via Composer:
composer require darvinstudio/darvin-crawler-bundle
Register the bundle in config/bundles.php (Laravel uses autoloading, so no manual registration is needed in Symfony).
Configuration
Publish the default config (if needed) and update config/packages/dev/darvin_crawler.yaml:
darvin_crawler:
default_uri: https://your-website.com # Set your site's base URL
blacklists:
parse: ["/admin/", "/login/"] # Skip parsing these paths
visit: ["/private/", "/api/"] # Skip crawling these paths
First Run Execute the crawler on your default URI:
php artisan darvin:crawler:crawl
Verify output for broken links (HTTP 4xx/5xx responses).
Scheduled Crawling Add a cron job (Linux/macOS) or Task Scheduler (Windows) to run weekly:
0 3 * * 0 php artisan darvin:crawler:crawl --env=production >> /var/log/crawler.log 2>&1
Tip: Use --env=production to avoid dev-specific issues.
Targeted Crawls Focus on critical sections (e.g., public-facing routes):
php artisan darvin:crawler:crawl https://your-website.com/blog --limit=100
--limit: Restrict depth/links (default: 50).-v: Verbose mode (shows all visited links).Integration with CI/CD Trigger crawls post-deploy (e.g., GitHub Actions) to catch regressions:
- name: Run link crawler
run: php artisan darvin:crawler:crawl --env=staging
Custom Output Handling Pipe results to a file or API:
php artisan darvin:crawler:crawl | grep "ERROR" > broken_links.txt
Or extend the command to log to a database (see Extension Points).
Dynamic Blacklists Load blacklists from a database or config file:
# config/packages/dev/darvin_crawler.yaml
blacklists:
parse:
- '%env(RESERVED_PATHS)%' # Use env vars for flexibility
Set in .env:
RESERVED_PATHS=/admin/,/private/
Parallel Crawling Use Laravel Queues to distribute crawling across workers:
// app/Console/Commands/CrawlCommand.php
public function handle() {
$uri = $this->argument('uri') ?? config('darvin_crawler.default_uri');
dispatch(new CrawlJob($uri))->onQueue('crawler');
}
Custom HTTP Client Override the default client (e.g., for auth headers):
// config/packages/dev/darvin_crawler.yaml
client:
options:
headers:
Authorization: "Bearer token123"
Rate Limiting
--delay=100 (milliseconds) in the command.Relative URLs
/about).darvin_crawler:
base_uri: https://your-website.com
JavaScript-Rendered Content
Blacklist Overlap
parse and visit blacklists may conflict.visit for broad exclusions (e.g., /api/) and parse for fine-grained control (e.g., /admin/settings).Verbose Mode
Always start with -v to inspect crawled paths:
php artisan darvin:crawler:crawl -v
Log Levels
Enable debug logging in config/logging.php:
'channels' => [
'single' => [
'level' => 'debug',
],
],
HTTP Debugging
Use --dump-server to inspect raw responses:
php artisan darvin:crawler:crawl --dump-server
Custom Crawler Class Override the default crawler to add logic (e.g., authentication):
// src/Crawler/CustomCrawler.php
namespace App\Crawler;
use DarvinCrawlerBundle\Crawler\Crawler;
class CustomCrawler extends Crawler {
protected function getClientOptions() {
$options = parent::getClientOptions();
$options['headers']['X-Custom-Header'] = 'value';
return $options;
}
}
Bind it in config/packages/dev/darvin_crawler.yaml:
crawler_class: App\Crawler\CustomCrawler
Event Listeners
Listen for darvin_crawler.link_found or darvin_crawler.link_error events to log to a database:
// app/Providers/EventServiceProvider.php
protected $listen = [
'darvin_crawler.link_error' => [
\App\Listeners\LogBrokenLink::class,
],
];
Output Formatters Extend the command to output JSON or CSV:
// app/Console/Commands/CrawlCommand.php
protected function formatOutput(array $links) {
return json_encode($links);
}
--limit=10 to validate before full crawls.memory_limit in php.ini./images/, /css/, /js/ to speed up crawls.How can I help you explore Laravel packages today?