Getting Started

Minimal Setup

Installation
```
composer require contextualcode/crawler
```
Ensure your project has a database (supports MySQL, PostgreSQL, SQLite) and configure it in .env.

Publish Config & Migrations

php artisan vendor:publish --provider="ContextualCode\Crawler\CrawlerServiceProvider"
php artisan migrate

This sets up the crawls, pages, and links tables.

First Crawl Define a seed URL and start a crawl:

use ContextualCode\Crawler\Crawler;

$crawler = new Crawler();
$crawler->addSeed('https://example.com')
        ->setDepth(2) // Limit crawl depth
        ->setConcurrency(5) // Parallel requests
        ->crawl();

Access Results Fetch crawled pages via Eloquent:

use ContextualCode\Crawler\Models\Page;

$pages = Page::where('url', 'like', '%example.com%')->get();

Implementation Patterns

Workflow: Structured Crawling

Define Rules Use filters to control what gets crawled:

$crawler->addFilter(function ($url) {
    return strpos($url, 'admin') === false; // Skip admin pages
});

Extract Data Process pages with a callback:

$crawler->onPage(function ($page) {
    $title = $page->title;
    $content = $page->content;
    // Store in DB or process further
});

Resumable Crawls Pause and resume crawls using:

$crawl = $crawler->start(); // Returns a Crawl model
$crawl->pause();
$crawl->resume();

Integration Tips

Queue Crawls Dispatch crawls to Laravel queues for background processing:

CrawlerJob::dispatch('https://example.com')->delay(now()->addMinutes(5));

Custom Storage Extend Page model to add custom fields:

php artisan make:model PageExtension --extends="ContextualCode\Crawler\Models\Page"

Rate Limiting Configure delays between requests:

$crawler->setDelay(1000); // 1-second delay (ms)

Gotchas and Tips

Pitfalls

Database Locks High concurrency may cause deadlocks. Use transactions sparingly or reduce setConcurrency().
Duplicate URLs The package deduplicates URLs, but ensure your filters don’t reintroduce duplicates.
Dynamic Content JavaScript-rendered pages won’t be crawled by default. Use a headless browser (e.g., Puppeteer) via a custom PageFetcher.

Debugging

Log Crawl Status Enable logging in config/crawler.php:

'log' => [
    'enabled' => true,
    'channel' => 'single',
],

Inspect Failed Requests Check the failed_requests table or enable:

$crawler->onError(function ($url, $exception) {
    \Log::error("Crawl error on {$url}: " . $exception->getMessage());
});

Extension Points

Custom Fetchers Implement ContextualCode\Crawler\Contracts\PageFetcher for non-HTTP sources (e.g., APIs).
Post-Processing Use onPage or onLink events to transform data before storage.

Sitemap Generation Export crawled URLs to a sitemap:

$urls = Page::pluck('url');
// Use a package like spatie/sitemap to generate XML

Crawler Laravel Package