acassan/php-crawler
Symfony 2 bundle integrating the PHPCrawler library to help you crawl and fetch web pages within your Symfony application. Provides a simple way to run crawling tasks and process discovered URLs and content.
Installation Add the bundle via Composer (though note the package is outdated and may require adjustments):
composer require acassan/php-crawler
Register the bundle in config/bundles.php:
return [
// ...
Acassan\PHPCrawlerBundle\AcassanPHPCrawlerBundle::class => ['all' => true],
];
First Use Case
Configure a basic crawler in config/packages/acassan_php_crawler.yaml:
acassan_php_crawler:
crawlers:
default:
start_urls: ['https://example.com']
rules:
- '.*example\.com.*'
Trigger the crawler via a command (if available) or manually in a service:
use Acassan\PHPCrawlerBundle\Crawler\Crawler;
$crawler = $this->get('acassan_php_crawler.crawler.default');
$crawler->crawl();
Scheduled Crawling
Use Symfony’s CronBundle or Laravel’s task scheduling to run crawlers periodically:
// In Laravel's app/Console/Kernel.php
protected function schedule(Schedule $schedule)
{
$schedule->command('phpcrawler:run default')->daily();
}
Data Processing Extend the crawler to process scraped data:
use Acassan\PHPCrawlerBundle\Event\CrawlEvent;
$eventDispatcher->addListener(CrawlEvent::EVENT_CRAWL, function (CrawlEvent $event) {
foreach ($event->getPages() as $page) {
// Process $page->getContent() or $page->getUrl()
}
});
Dynamic URL Handling Use callbacks for dynamic URL generation:
rules:
- '.*example\.com.*'
- callback: 'app.callback_for_dynamic_urls'
Service Binding
Bind the crawler service in AppServiceProvider:
$this->app->bind('phpcrawler', function () {
return $this->app->make('acassan_php_crawler.crawler.default');
});
Queue Jobs Dispatch crawling as a queued job for long-running tasks:
use Illuminate\Support\Facades\Bus;
Bus::dispatch(new CrawlJob('default'));
Outdated Package
symfony/dependency-injection, symfony/http-kernel).No Built-in Storage
$eventDispatcher->addListener(CrawlEvent::EVENT_CRAWL, function (CrawlEvent $event) {
foreach ($event->getPages() as $page) {
Page::create([
'url' => $page->getUrl(),
'content' => $page->getContent(),
]);
}
});
Rate Limiting
GuzzleHttp\Client with delays:$client = new Client([
'timeout' => 10,
'delay' => 1000, // 1-second delay between requests
]);
Log Crawled Pages Enable Symfony’s profiler or Laravel’s logging:
# config/packages/monolog.yaml
handlers:
main:
type: stream
path: "%kernel.logs_dir%/%kernel.environment%.log"
level: debug
Check HTTP Status Codes Filter failed requests:
$eventDispatcher->addListener(CrawlEvent::EVENT_CRAWL, function (CrawlEvent $event) {
$event->getPages()->filter(function ($page) {
return $page->getStatusCode() !== 200;
});
});
Custom Crawler Classes
Extend Acassan\PHPCrawlerBundle\Crawler\AbstractCrawler for bespoke logic:
class CustomCrawler extends AbstractCrawler {
protected function processPage(Page $page) {
// Custom logic
}
}
Middleware Integration
Use Symfony’s HttpClient middleware or Laravel’s HttpClient for request modification:
$client = new Client([
'headers' => [
'User-Agent' => 'MyCrawler/1.0',
],
]);
Event Subscribers
Listen for CrawlEvent to inject logic:
class CrawlSubscriber implements EventSubscriberInterface {
public static function getSubscribedEvents() {
return [
CrawlEvent::EVENT_CRAWL => 'onCrawl',
];
}
public function onCrawl(CrawlEvent $event) {
// Modify $event->getPages()
}
}
How can I help you explore Laravel packages today?