Installation
composer require contextualcode/crawler
Ensure your project has a database (supports MySQL, PostgreSQL, SQLite) and configure it in .env.
Publish Config & Migrations
php artisan vendor:publish --provider="ContextualCode\Crawler\CrawlerServiceProvider"
php artisan migrate
This sets up the crawls, pages, and links tables.
First Crawl Define a seed URL and start a crawl:
use ContextualCode\Crawler\Crawler;
$crawler = new Crawler();
$crawler->addSeed('https://example.com')
->setDepth(2) // Limit crawl depth
->setConcurrency(5) // Parallel requests
->crawl();
Access Results Fetch crawled pages via Eloquent:
use ContextualCode\Crawler\Models\Page;
$pages = Page::where('url', 'like', '%example.com%')->get();
Define Rules Use filters to control what gets crawled:
$crawler->addFilter(function ($url) {
return strpos($url, 'admin') === false; // Skip admin pages
});
Extract Data Process pages with a callback:
$crawler->onPage(function ($page) {
$title = $page->title;
$content = $page->content;
// Store in DB or process further
});
Resumable Crawls Pause and resume crawls using:
$crawl = $crawler->start(); // Returns a Crawl model
$crawl->pause();
$crawl->resume();
Queue Crawls Dispatch crawls to Laravel queues for background processing:
CrawlerJob::dispatch('https://example.com')->delay(now()->addMinutes(5));
Custom Storage
Extend Page model to add custom fields:
php artisan make:model PageExtension --extends="ContextualCode\Crawler\Models\Page"
Rate Limiting Configure delays between requests:
$crawler->setDelay(1000); // 1-second delay (ms)
Database Locks
High concurrency may cause deadlocks. Use transactions sparingly or reduce setConcurrency().
Duplicate URLs The package deduplicates URLs, but ensure your filters don’t reintroduce duplicates.
Dynamic Content
JavaScript-rendered pages won’t be crawled by default. Use a headless browser (e.g., Puppeteer) via a custom PageFetcher.
Log Crawl Status
Enable logging in config/crawler.php:
'log' => [
'enabled' => true,
'channel' => 'single',
],
Inspect Failed Requests
Check the failed_requests table or enable:
$crawler->onError(function ($url, $exception) {
\Log::error("Crawl error on {$url}: " . $exception->getMessage());
});
Custom Fetchers
Implement ContextualCode\Crawler\Contracts\PageFetcher for non-HTTP sources (e.g., APIs).
Post-Processing
Use onPage or onLink events to transform data before storage.
Sitemap Generation Export crawled URLs to a sitemap:
$urls = Page::pluck('url');
// Use a package like spatie/sitemap to generate XML
How can I help you explore Laravel packages today?