andrew-svirin/resource-crawler-bundle
composer require andrew-svirin/resource-crawler-bundle:dev-main
doctrine.yaml (to exclude bundle tables from ORM):
doctrine:
dbal:
schema_filter: ~^(?!resource_crawler_)~
resource_crawler.yaml (choose storage backend):
resource_crawler:
process:
is_lockable: true
store: 'resource_crawler.process_db_store' # or 'resource_crawler.process_file_store'
file_store:
dir: "%kernel.project_dir%/storage/saver"
php bin/console doctrine:migrations:migrate
$crawler = $this->get('resource_crawler.crawler');
$task = $crawler->crawlWebResource(
'https://example.com',
['+example.com/'], // Include paths
[], // Substitution rules
null // Ref handler (optional)
);
crawlWebResource() to start crawling a URL with path filters.analyzeCrawlingWebResource().+example.com/blog/*).Task Creation:
$task = $crawler->crawlWebResource(
$url,
$pathMasks, // e.g., ['+example.com/', '-*.pdf']
$substitutionRules, // Regex to clean URLs
$refHandler // Custom logic per link
);
+ to include, - to exclude (e.g., +example.com/blog/* -example.com/blog/archives).Iterative Crawling:
process settings).analyzeCrawlingWebResource() to inspect progress:
$stats = $crawler->analyzeCrawlingWebResource($url);
// $stats includes counts of processed/ignored/errored nodes.
Handling Results:
resource_crawler_nodes table).$nodes = $entityManager->createQueryBuilder()
->select('n.uri_path')
->from('App\Entity\ResourceCrawlerNode', 'n')
->where('n.status = :status')
->setParameter('status', 'processed')
->getQuery()
->getResult();
Custom Logic:
RefHandlerClosureInterface for per-link logic:
$handler = new class() implements RefHandlerClosureInterface {
public function call(Ref $ref, CrawlingTask $task): void {
if ($ref->getUriPath() === '/admin') {
$task->ignoreNode($ref); // Skip admin pages
}
}
};
delay option to avoid overloading targets.process_file_store (faster than DB for metadata-heavy workloads).Locking Issues:
is_lockable: true, ensure your storage backend (DB/file) supports concurrent access.resource_crawler_processes table for stuck locks.Path Mask Overlap:
+example.com/* and -example.com/private) may cause unintended exclusions.analyzeCrawlingWebResource() before full crawls.Substitution Rules:
s/.*//) may break URLs.preg_replace() first.Memory Limits:
process_file_store.HTTPS/HTTP Mixed Content:
$client = $this->get('http_client');
$client->getOptions()['allow_http' => true];
resource_crawler events.resetWebResource($url) to clear all data for a domain (useful for retries).rollbackTask($task) to reprocess nodes.ResourceCrawlerBundle\Store\ProcessStoreInterface for alternative backends (e.g., Redis).ResourceCrawlerBundle\Crawler\NodeFilterInterface to add logic (e.g., skip non-200 responses).resource_crawler.node_processed events for post-crawl actions.storage/saver is writable (chmod 755).schema_filter from doctrine.yaml.max_concurrency.resource_crawler_nodes for faster queries:
CREATE INDEX idx_nodes_process_uri ON resource_crawler_nodes(process_id, uri_path);
How can I help you explore Laravel packages today?