Install the package:
composer require terminal42/escargot
Define base URIs and queue:
use Terminal42\Escargot\Escargot;
use Terminal42\Escargot\BaseUriCollection;
use Terminal42\Escargot\Queue\InMemoryQueue;
use Nyholm\Psr7\Uri;
$baseUris = new BaseUriCollection();
$baseUris->add(new Uri('https://example.com'));
$queue = new InMemoryQueue();
$escargot = Escargot::create($baseUris, $queue);
Add subscribers (e.g., for HTML crawling):
$escargot->addSubscriber(new \Terminal42\Escargot\Subscriber\RobotsSubscriber());
$escargot->addSubscriber(new \Terminal42\Escargot\Subscriber\HtmlCrawlerSubscriber());
$escargot->addSubscriber(new class implements \Terminal42\Escargot\Subscriber\SubscriberInterface {
public function shouldRequest($crawlUri, $currentDecision) {
return self::DECISION_POSITIVE; // Always crawl
}
public function needsContent($crawlUri, $response, $chunk, $currentDecision) {
return self::DECISION_POSITIVE; // Always load content
}
public function onLastChunk($crawlUri, $response, $chunk) {
// Process response (e.g., save to DB)
}
});
Start crawling:
$escargot->crawl();
// Initialize with a single URL
$escargot = Escargot::create(
(new BaseUriCollection())->add(new Uri('https://example.com')),
new InMemoryQueue()
);
// Add built-in subscribers for robots.txt and HTML parsing
$escargot->addSubscriber(new RobotsSubscriber());
$escargot->addSubscriber(new HtmlCrawlerSubscriber());
// Add a subscriber to process responses
$escargot->addSubscriber(new class implements SubscriberInterface {
public function shouldRequest($crawlUri, $currentDecision) {
return self::DECISION_POSITIVE; // Crawl everything
}
public function needsContent($crawlUri, $response, $chunk, $currentDecision) {
return self::DECISION_POSITIVE; // Load full content
}
public function onLastChunk($crawlUri, $response, $chunk) {
echo "Crawled: " . $crawlUri->getUri() . "\n";
// Extract data (e.g., with Symfony's DomCrawler)
}
});
// Run the crawler
$escargot->crawl();
InMemoryQueue (ephemeral, fast).
$queue = new InMemoryQueue();
DoctrineQueue (persistent, database-backed).
$queue = new DoctrineQueue($pdoConnection);
LazyQueue to offload to DB only when needed.
$queue = new LazyQueue(new InMemoryQueue(), new DoctrineQueue($pdoConnection));
$escargot->crawl(); // Work in-memory first
$queue->commit($jobId); // Persist to DB later
Use shouldRequest() to control which URLs are crawled:
$escargot->addSubscriber(new class implements SubscriberInterface {
public function shouldRequest($crawlUri, $currentDecision) {
return $crawlUri->getUri()->getHost() === 'example.com'
? self::DECISION_POSITIVE
: self::DECISION_NEGATIVE;
}
// ... other methods
});
Use needsContent() to avoid loading large responses:
public function needsContent($crawlUri, $response, $chunk, $currentDecision) {
return $response->getStatusCode() === 200
? self::DECISION_POSITIVE
: self::DECISION_NEGATIVE;
}
Add metadata to URIs via tags (e.g., skip nofollow links):
$escargot->addSubscriber(new class implements SubscriberInterface {
public function shouldRequest($crawlUri, $currentDecision) {
return !$crawlUri->hasTag('nofollow')
? self::DECISION_POSITIVE
: self::DECISION_NEGATIVE;
}
// ...
});
Register Escargot as a singleton in AppServiceProvider:
public function register()
{
$this->app->singleton(Escargot::class, function ($app) {
$queue = new DoctrineQueue($app['db']->connection()->getPdo());
return Escargot::create(
(new BaseUriCollection())->add(new Uri(config('escargot.start_url'))),
$queue
);
});
}
Create a crawler command:
php artisan make:command CrawlWebsites
// In CrawlWebsitesCommand.php
public function handle()
{
$escargot = app(Escargot::class);
$escargot->addSubscriber(new MySubscriber());
$escargot->crawl();
}
Resume a paused crawl:
$escargot = Escargot::createFromJobId($jobId, $queue);
$escargot->crawl(); // Continues from where it left off
$escargot->addSubscriber(new class implements ExceptionSubscriberInterface {
public function onTransportException($crawlUri, $exception, $response) {
Log::error("Failed to fetch {$crawlUri}: " . $exception->getMessage());
}
});
$escargot->addSubscriber(new class implements ExceptionSubscriberInterface {
public function onHttpException($crawlUri, $exception, $response, $chunk) {
if ($response->getStatusCode() === 404) {
Log::debug("Page not found: {$crawlUri}");
}
}
});
Store dynamic data (e.g., API responses) without bloating the queue:
// Subscriber 1: Adds a tag
$escargot->addSubscriber(new class implements SubscriberInterface {
public function onLastChunk($crawlUri, $response, $chunk) {
$crawlUri->addTag('api-data', 'lazy');
}
});
// Subscriber 2: Resolves the tag
$escargot->addSubscriber(new class implements TagValueResolvingSubscriberInterface {
public function resolveTagValue($tag, $crawlUri) {
if ($tag === 'api-data') {
return $this->fetchExternalData($crawlUri->getUri());
}
}
});
InMemoryQueue: Data is lost on process exit. Use only for testing.DoctrineQueue: Requires a PDO connection. Configure in Laravel via:
$queue = new DoctrineQueue($app['db']->connection()->getPdo());
LazyQueue: Call $queue->commit($jobId) explicitly to persist data.HttpClient with concurrency:
$client = new CurlHttpClient(['max_concurrency' => 10]);
$escargot = Escargot::create($baseUris, $queue, $client);
$escargot->addSubscriber(new class implements SubscriberInterface {
private $lastRequestTime = 0;
public function shouldRequest($crawlUri, $currentDecision) {
$now = time();
if ($now - $this->lastRequestTime < 1) { // 1 request per second
return self::DECISION_NEGATIVE;
}
$this->lastRequestTime = $now;
return self::DECISION_POSITIVE;
}
// ...
});
$escargot->addSubscriber(new class implements SubscriberInterface {
public function shouldRequest($crawlUri, $currentDecision) {
Log::debug("Should request {$crawlUri}? Decision: {$currentDecision}");
return self::DECISION_ABST
How can I help you explore Laravel packages today?