Getting Started

Minimal Setup

Installation:
```
composer require spatie/crawler
```

First Crawl:

use Spatie\Crawler\Crawler;

Crawler::create('https://example.com')
    ->onCrawled(function (string $url, CrawlResponse $response) {
        echo "Crawled: {$url} (Status: {$response->status()})";
    })
    ->start();

Key First Use Cases:

Collect all internal URLs:

$urls = Crawler::create('https://example.com')
    ->internalOnly()
    ->depth(2)
    ->foundUrls();

Test locally with fake responses:

Crawler::create('https://example.com')
    ->fake([
        'https://example.com' => '<html><a href="/about">About</a></html>',
        'https://example.com/about' => '<html>About page</html>',
    ])
    ->foundUrls();

Implementation Patterns

Core Workflows

Structured Crawling with Observers:

class PageLogger extends CrawlObserver {
    public function crawled(string $url, CrawlResponse $response) {
        Log::info("Crawled {$url} with status {$response->status()}");
    }
}

Crawler::create('https://example.com')
    ->addObserver(new PageLogger())
    ->start();

Progress Tracking:

Crawler::create('https://example.com')
    ->onCrawled(function (string $url, CrawlResponse $response, CrawlProgress $progress) {
        echo "Progress: {$progress->urlsProcessed}/{$progress->urlsFound}\n";
    })
    ->start();

Dynamic Crawl Control:

$shouldStop = false;
Crawler::create('https://example.com')
    ->shouldStopCallback(function (Crawler $crawler) use (&$shouldStop) {
        return $shouldStop;
    })
    ->onCrawled(function (string $url) use (&$shouldStop) {
        if ($url === 'https://example.com/stop') {
            $shouldStop = true;
        }
    })
    ->start();

Integration Tips

Laravel Service Provider:

public function register() {
    $this->app->singleton(Crawler::class, function () {
        return new Crawler();
    });
}

Queue Jobs for Large Crawls:

class CrawlJob implements ShouldQueue {
    public function handle() {
        Crawler::create('https://example.com')
            ->limit(1000)
            ->start();
    }
}

Store Results in Database:

Crawler::create('https://example.com')
    ->onCrawled(function (string $url, CrawlResponse $response) {
        Page::updateOrCreate(['url' => $url], [
            'status_code' => $response->status(),
            'content' => $response->body(),
        ]);
    })
    ->start();

Gotchas and Tips

Common Pitfalls

JavaScript Rendering Overhead:
- Enabling JavaScript (->withBrowser()) significantly increases crawl time. Use only when necessary.
- Tip: Test with ->fake() first to validate logic before enabling JS.
Rate Limiting:
- Default concurrency (10) may trigger rate limits. Adjust with:
```
->concurrency(2) // Lower for sensitive sites
```
Robots.txt Ignored by Default:
- To respect robots.txt, use:
```
->respectRobotsTxt()
```
Duplicate URLs:
- Use ->uniqueUrls() to avoid processing the same URL multiple times.
Memory Leaks:
- Large crawls may exhaust memory. Use ->limit() and process results incrementally.

Debugging Tips

Log Failed Requests:

->onFailed(function (string $url, RequestException $exception) {
    Log::error("Failed to crawl {$url}: {$exception->getMessage()}");
})

Inspect Transfer Stats:

->onFailed(function (string $url, RequestException $exception, TransferStatistics $stats) {
    if ($stats->transferTimeInMs() > 10000) {
        Log::warning("Slow crawl: {$url} took {$stats->transferTimeInMs()}ms");
    }
})

Extension Points

Custom URL Filtering:

->filterUrls(function (string $url, string $foundOnUrl) {
    return str_contains($url, 'important-section');
})

Modify Request Headers:

->withOptions([
    'headers' => [
        'User-Agent' => 'MyCrawler/1.0',
        'Accept-Language' => 'en-US',
    ],
])

Handle Redirects:

->followRedirects(false) // Disable by default
->onRedirect(function (string $from, string $to) {
    Log::info("Redirect: {$from} -> {$to}");
})

Custom Response Processing:

->onCrawled(function (string $url, CrawlResponse $response) {
    $dom = new \DomDocument();
    @$dom->loadHTML($response->body());
    // Process DOM here
})

Performance Optimization

Cache Responses:

->withCache(new FileCache(storage_path('crawler_cache')))

Prioritize URLs:

->prioritizeUrls(['/high-priority-page'])

Disable Unnecessary Features:

->withoutBrowser() // Skip JS rendering
->withoutCurl()    // Use Guzzle only

Crawler Laravel Package