Getting Started

Minimal Setup

Installation:

composer require andrew-svirin/resource-crawler-bundle:dev-main

Configure doctrine.yaml (to exclude bundle tables from ORM):

doctrine:
    dbal:
        schema_filter: ~^(?!resource_crawler_)~

Configure resource_crawler.yaml (choose storage backend):

resource_crawler:
  process:
    is_lockable: true
    store: 'resource_crawler.process_db_store'  # or 'resource_crawler.process_file_store'
    file_store:
      dir: "%kernel.project_dir%/storage/saver"

Run migrations:

php bin/console doctrine:migrations:migrate

First Crawl:

$crawler = $this->get('resource_crawler.crawler');
$task = $crawler->crawlWebResource(
    'https://example.com',
    ['+example.com/'],  // Include paths
    [],                 // Substitution rules
    null                // Ref handler (optional)
);

First Use Case: Basic Web Crawling

Use crawlWebResource() to start crawling a URL with path filters.
Check results via analyzeCrawlingWebResource().
Example: Crawl a blog to extract all posts (e.g., +example.com/blog/*).

Implementation Patterns

Core Workflow

Task Creation:

$task = $crawler->crawlWebResource(
    $url,
    $pathMasks,       // e.g., ['+example.com/', '-*.pdf']
    $substitutionRules, // Regex to clean URLs
    $refHandler       // Custom logic per link
);

Path Masks: Use + to include, - to exclude (e.g., +example.com/blog/* -example.com/blog/archives).
Substitution Rules: Modify URLs dynamically (e.g., remove query params).

Iterative Crawling:
- The bundle processes URLs in batches (configurable via process settings).
- Use analyzeCrawlingWebResource() to inspect progress:
```
$stats = $crawler->analyzeCrawlingWebResource($url);
// $stats includes counts of processed/ignored/errored nodes.
```

Handling Results:

Access crawled nodes via Doctrine queries (e.g., resource_crawler_nodes table).

Example: Fetch all processed HTML nodes:

$nodes = $entityManager->createQueryBuilder()
    ->select('n.uri_path')
    ->from('App\Entity\ResourceCrawlerNode', 'n')
    ->where('n.status = :status')
    ->setParameter('status', 'processed')
    ->getQuery()
    ->getResult();

Custom Logic:

Implement RefHandlerClosureInterface for per-link logic:

$handler = new class() implements RefHandlerClosureInterface {
    public function call(Ref $ref, CrawlingTask $task): void {
        if ($ref->getUriPath() === '/admin') {
            $task->ignoreNode($ref); // Skip admin pages
        }
    }
};

Integration Tips

Queue Crawling: Combine with Symfony Messenger to process tasks asynchronously.
Rate Limiting: Use Symfony HTTP Client’s delay option to avoid overloading targets.
Storage: For large crawls, prefer process_file_store (faster than DB for metadata-heavy workloads).

Gotchas and Tips

Pitfalls

Locking Issues:
- If is_lockable: true, ensure your storage backend (DB/file) supports concurrent access.
- Debug: Check resource_crawler_processes table for stuck locks.
Path Mask Overlap:
- Conflicting masks (e.g., +example.com/* and -example.com/private) may cause unintended exclusions.
- Fix: Test with analyzeCrawlingWebResource() before full crawls.
Substitution Rules:
- Overly aggressive regex (e.g., s/.*//) may break URLs.
- Tip: Test rules with preg_replace() first.
Memory Limits:
- Crawling large sites may hit PHP memory limits.
- Fix: Process in smaller batches or use process_file_store.
HTTPS/HTTP Mixed Content:
- The crawler follows redirects but may fail on mixed-content warnings.
- Fix: Configure Symfony HTTP Client to allow insecure content:
```
$client = $this->get('http_client');
$client->getOptions()['allow_http' => true];
```

Debugging

Log Crawling: Enable Symfony’s profiler to inspect resource_crawler events.
Reset State: Use resetWebResource($url) to clear all data for a domain (useful for retries).
Rollback Tasks: If a crawl fails midway, use rollbackTask($task) to reprocess nodes.

Extension Points

Custom Stores: Override ResourceCrawlerBundle\Store\ProcessStoreInterface for alternative backends (e.g., Redis).
Node Filters: Extend ResourceCrawlerBundle\Crawler\NodeFilterInterface to add logic (e.g., skip non-200 responses).
Event Listeners: Subscribe to resource_crawler.node_processed events for post-crawl actions.

Configuration Quirks

File Store Directory: Ensure storage/saver is writable (chmod 755).
Doctrine Schema Filter: If you do want to use the tables in Doctrine, remove the schema_filter from doctrine.yaml.

Performance Tips

Batch Processing: Limit concurrent requests via Symfony HTTP Client’s max_concurrency.

Indexing: Add a composite index to resource_crawler_nodes for faster queries:

CREATE INDEX idx_nodes_process_uri ON resource_crawler_nodes(process_id, uri_path);

Resource Crawler Bundle Laravel Package