Weave Code
Code Weaver
Helps Laravel developers discover, compare, and choose open-source packages. See popularity, security, maintainers, and scores at a glance to make better decisions.
Feedback
Share your thoughts, report bugs, or suggest improvements.
Subject
Message

Php Crawler Laravel Package

acassan/php-crawler

Symfony 2 bundle integrating the PHPCrawler library to help you crawl and fetch web pages within your Symfony application. Provides a simple way to run crawling tasks and process discovered URLs and content.

View on GitHub
Deep Wiki
Context7

Getting Started

Minimal Setup

  1. Installation Add the bundle via Composer (though note the package is outdated and may require adjustments):

    composer require acassan/php-crawler
    

    Register the bundle in config/bundles.php:

    return [
        // ...
        Acassan\PHPCrawlerBundle\AcassanPHPCrawlerBundle::class => ['all' => true],
    ];
    
  2. First Use Case Configure a basic crawler in config/packages/acassan_php_crawler.yaml:

    acassan_php_crawler:
        crawlers:
            default:
                start_urls: ['https://example.com']
                rules:
                    - '.*example\.com.*'
    

    Trigger the crawler via a command (if available) or manually in a service:

    use Acassan\PHPCrawlerBundle\Crawler\Crawler;
    
    $crawler = $this->get('acassan_php_crawler.crawler.default');
    $crawler->crawl();
    

Implementation Patterns

Workflow Integration

  1. Scheduled Crawling Use Symfony’s CronBundle or Laravel’s task scheduling to run crawlers periodically:

    // In Laravel's app/Console/Kernel.php
    protected function schedule(Schedule $schedule)
    {
        $schedule->command('phpcrawler:run default')->daily();
    }
    
  2. Data Processing Extend the crawler to process scraped data:

    use Acassan\PHPCrawlerBundle\Event\CrawlEvent;
    
    $eventDispatcher->addListener(CrawlEvent::EVENT_CRAWL, function (CrawlEvent $event) {
        foreach ($event->getPages() as $page) {
            // Process $page->getContent() or $page->getUrl()
        }
    });
    
  3. Dynamic URL Handling Use callbacks for dynamic URL generation:

    rules:
        - '.*example\.com.*'
        - callback: 'app.callback_for_dynamic_urls'
    

Laravel-Specific Tips

  • Service Binding Bind the crawler service in AppServiceProvider:

    $this->app->bind('phpcrawler', function () {
        return $this->app->make('acassan_php_crawler.crawler.default');
    });
    
  • Queue Jobs Dispatch crawling as a queued job for long-running tasks:

    use Illuminate\Support\Facades\Bus;
    
    Bus::dispatch(new CrawlJob('default'));
    

Gotchas and Tips

Pitfalls

  1. Outdated Package

    • The bundle is unmaintained (Symfony 2 → Symfony 4+ compatibility issues).
    • Workaround: Fork the repo and update dependencies (e.g., symfony/dependency-injection, symfony/http-kernel).
  2. No Built-in Storage

    • Crawled data isn’t persisted by default. Use a database or filesystem:
    $eventDispatcher->addListener(CrawlEvent::EVENT_CRAWL, function (CrawlEvent $event) {
        foreach ($event->getPages() as $page) {
            Page::create([
                'url' => $page->getUrl(),
                'content' => $page->getContent(),
            ]);
        }
    });
    
  3. Rate Limiting

    • No built-in throttling. Implement middleware or use GuzzleHttp\Client with delays:
    $client = new Client([
        'timeout' => 10,
        'delay' => 1000, // 1-second delay between requests
    ]);
    

Debugging

  • Log Crawled Pages Enable Symfony’s profiler or Laravel’s logging:

    # config/packages/monolog.yaml
    handlers:
        main:
            type: stream
            path: "%kernel.logs_dir%/%kernel.environment%.log"
            level: debug
    
  • Check HTTP Status Codes Filter failed requests:

    $eventDispatcher->addListener(CrawlEvent::EVENT_CRAWL, function (CrawlEvent $event) {
        $event->getPages()->filter(function ($page) {
            return $page->getStatusCode() !== 200;
        });
    });
    

Extension Points

  1. Custom Crawler Classes Extend Acassan\PHPCrawlerBundle\Crawler\AbstractCrawler for bespoke logic:

    class CustomCrawler extends AbstractCrawler {
        protected function processPage(Page $page) {
            // Custom logic
        }
    }
    
  2. Middleware Integration Use Symfony’s HttpClient middleware or Laravel’s HttpClient for request modification:

    $client = new Client([
        'headers' => [
            'User-Agent' => 'MyCrawler/1.0',
        ],
    ]);
    
  3. Event Subscribers Listen for CrawlEvent to inject logic:

    class CrawlSubscriber implements EventSubscriberInterface {
        public static function getSubscribedEvents() {
            return [
                CrawlEvent::EVENT_CRAWL => 'onCrawl',
            ];
        }
    
        public function onCrawl(CrawlEvent $event) {
            // Modify $event->getPages()
        }
    }
    
Weaver

How can I help you explore Laravel packages today?

Conversation history is not saved when not logged in.
Prompt
Add packages to context
No packages found.
babenkoivan/elastic-client
innmind/static-analysis
innmind/coding-standard
datacore/hub-sdk
alengo/sulu-http-cache-bundle
develia/commons
cuci/prototurk-sdk
cuci/prototurk-sdk-symfony
develia/geo-bundle
dreamzy/livewire-charts
touchestate-sdk/php-sdk
22h/doctrine-garbage-collection-bundle
imbo/imbo-coding-standard
visualbuilder/filament-lottie
servicioslineaonce/starter-kit
atomcoder/laravel-reorderable
irajul/filament-shadcn-theme
agtp/agtp-php
agtp/mod-php
centraldesktop/protobuf-php