Weave Code
Code Weaver
Helps Laravel developers discover, compare, and choose open-source packages. See popularity, security, maintainers, and scores at a glance to make better decisions.
Feedback
Share your thoughts, report bugs, or suggest improvements.
Subject
Message

Escargot Laravel Package

terminal42/escargot

View on GitHub
Deep Wiki
Context7

Getting Started

Minimal Setup

  1. Install the package:

    composer require terminal42/escargot
    
  2. Define base URIs and queue:

    use Terminal42\Escargot\Escargot;
    use Terminal42\Escargot\BaseUriCollection;
    use Terminal42\Escargot\Queue\InMemoryQueue;
    use Nyholm\Psr7\Uri;
    
    $baseUris = new BaseUriCollection();
    $baseUris->add(new Uri('https://example.com'));
    $queue = new InMemoryQueue();
    
    $escargot = Escargot::create($baseUris, $queue);
    
  3. Add subscribers (e.g., for HTML crawling):

    $escargot->addSubscriber(new \Terminal42\Escargot\Subscriber\RobotsSubscriber());
    $escargot->addSubscriber(new \Terminal42\Escargot\Subscriber\HtmlCrawlerSubscriber());
    $escargot->addSubscriber(new class implements \Terminal42\Escargot\Subscriber\SubscriberInterface {
        public function shouldRequest($crawlUri, $currentDecision) {
            return self::DECISION_POSITIVE; // Always crawl
        }
        public function needsContent($crawlUri, $response, $chunk, $currentDecision) {
            return self::DECISION_POSITIVE; // Always load content
        }
        public function onLastChunk($crawlUri, $response, $chunk) {
            // Process response (e.g., save to DB)
        }
    });
    
  4. Start crawling:

    $escargot->crawl();
    

First Use Case: Simple HTML Crawler

// Initialize with a single URL
$escargot = Escargot::create(
    (new BaseUriCollection())->add(new Uri('https://example.com')),
    new InMemoryQueue()
);

// Add built-in subscribers for robots.txt and HTML parsing
$escargot->addSubscriber(new RobotsSubscriber());
$escargot->addSubscriber(new HtmlCrawlerSubscriber());

// Add a subscriber to process responses
$escargot->addSubscriber(new class implements SubscriberInterface {
    public function shouldRequest($crawlUri, $currentDecision) {
        return self::DECISION_POSITIVE; // Crawl everything
    }
    public function needsContent($crawlUri, $response, $chunk, $currentDecision) {
        return self::DECISION_POSITIVE; // Load full content
    }
    public function onLastChunk($crawlUri, $response, $chunk) {
        echo "Crawled: " . $crawlUri->getUri() . "\n";
        // Extract data (e.g., with Symfony's DomCrawler)
    }
});

// Run the crawler
$escargot->crawl();

Implementation Patterns

1. Queue Management

  • For CLI/Testing: Use InMemoryQueue (ephemeral, fast).
    $queue = new InMemoryQueue();
    
  • For Production: Use DoctrineQueue (persistent, database-backed).
    $queue = new DoctrineQueue($pdoConnection);
    
  • Hybrid Approach: Use LazyQueue to offload to DB only when needed.
    $queue = new LazyQueue(new InMemoryQueue(), new DoctrineQueue($pdoConnection));
    $escargot->crawl(); // Work in-memory first
    $queue->commit($jobId); // Persist to DB later
    

2. Subscriber Workflows

Filtering Requests

Use shouldRequest() to control which URLs are crawled:

$escargot->addSubscriber(new class implements SubscriberInterface {
    public function shouldRequest($crawlUri, $currentDecision) {
        return $crawlUri->getUri()->getHost() === 'example.com'
            ? self::DECISION_POSITIVE
            : self::DECISION_NEGATIVE;
    }
    // ... other methods
});

Conditional Content Loading

Use needsContent() to avoid loading large responses:

public function needsContent($crawlUri, $response, $chunk, $currentDecision) {
    return $response->getStatusCode() === 200
        ? self::DECISION_POSITIVE
        : self::DECISION_NEGATIVE;
}

Tag-Based Routing

Add metadata to URIs via tags (e.g., skip nofollow links):

$escargot->addSubscriber(new class implements SubscriberInterface {
    public function shouldRequest($crawlUri, $currentDecision) {
        return !$crawlUri->hasTag('nofollow')
            ? self::DECISION_POSITIVE
            : self::DECISION_NEGATIVE;
    }
    // ...
});

3. Integration with Laravel

Service Provider

Register Escargot as a singleton in AppServiceProvider:

public function register()
{
    $this->app->singleton(Escargot::class, function ($app) {
        $queue = new DoctrineQueue($app['db']->connection()->getPdo());
        return Escargot::create(
            (new BaseUriCollection())->add(new Uri(config('escargot.start_url'))),
            $queue
        );
    });
}

Artisan Command

Create a crawler command:

php artisan make:command CrawlWebsites

// In CrawlWebsitesCommand.php
public function handle()
{
    $escargot = app(Escargot::class);
    $escargot->addSubscriber(new MySubscriber());
    $escargot->crawl();
}

Job Resumption

Resume a paused crawl:

$escargot = Escargot::createFromJobId($jobId, $queue);
$escargot->crawl(); // Continues from where it left off

4. Handling Exceptions

Transport Errors (Timeouts, DNS)

$escargot->addSubscriber(new class implements ExceptionSubscriberInterface {
    public function onTransportException($crawlUri, $exception, $response) {
        Log::error("Failed to fetch {$crawlUri}: " . $exception->getMessage());
    }
});

HTTP Errors (4xx/5xx)

$escargot->addSubscriber(new class implements ExceptionSubscriberInterface {
    public function onHttpException($crawlUri, $exception, $response, $chunk) {
        if ($response->getStatusCode() === 404) {
            Log::debug("Page not found: {$crawlUri}");
        }
    }
});

5. Lazy-Loaded Tags

Store dynamic data (e.g., API responses) without bloating the queue:

// Subscriber 1: Adds a tag
$escargot->addSubscriber(new class implements SubscriberInterface {
    public function onLastChunk($crawlUri, $response, $chunk) {
        $crawlUri->addTag('api-data', 'lazy');
    }
});

// Subscriber 2: Resolves the tag
$escargot->addSubscriber(new class implements TagValueResolvingSubscriberInterface {
    public function resolveTagValue($tag, $crawlUri) {
        if ($tag === 'api-data') {
            return $this->fetchExternalData($crawlUri->getUri());
        }
    }
});

Gotchas and Tips

1. Queue Persistence Pitfalls

  • InMemoryQueue: Data is lost on process exit. Use only for testing.
  • DoctrineQueue: Requires a PDO connection. Configure in Laravel via:
    $queue = new DoctrineQueue($app['db']->connection()->getPdo());
    
  • LazyQueue: Call $queue->commit($jobId) explicitly to persist data.

2. Performance Tips

  • Concurrency: Use Symfony’s HttpClient with concurrency:
    $client = new CurlHttpClient(['max_concurrency' => 10]);
    $escargot = Escargot::create($baseUris, $queue, $client);
    
  • Rate Limiting: Implement a subscriber to throttle requests:
    $escargot->addSubscriber(new class implements SubscriberInterface {
        private $lastRequestTime = 0;
        public function shouldRequest($crawlUri, $currentDecision) {
            $now = time();
            if ($now - $this->lastRequestTime < 1) { // 1 request per second
                return self::DECISION_NEGATIVE;
            }
            $this->lastRequestTime = $now;
            return self::DECISION_POSITIVE;
        }
        // ...
    });
    

3. Debugging

  • Log Subscriber Decisions: Add a debug subscriber:
    $escargot->addSubscriber(new class implements SubscriberInterface {
        public function shouldRequest($crawlUri, $currentDecision) {
            Log::debug("Should request {$crawlUri}? Decision: {$currentDecision}");
            return self::DECISION_ABST
    
Weaver

How can I help you explore Laravel packages today?

Conversation history is not saved when not logged in.
Prompt
Add packages to context
No packages found.
croct/coding-standard
croct/plug-php
nqxcode/phpmorphy
boundwize/pyrameter
develia/commons
dmstr/symfony-system-resources-bundle
cuci/prototurk-sdk
cuci/prototurk-sdk-symfony
renatomarinho/laravel-page-speed
develia/geo-bundle
austinheap/laravel-database-encryption
dreamzy/livewire-charts
touchestate-sdk/php-sdk
22h/doctrine-garbage-collection-bundle
imbo/imbo-coding-standard
visualbuilder/filament-lottie
servicioslineaonce/starter-kit
atomcoder/laravel-reorderable
irajul/filament-shadcn-theme
agtp/agtp-php