Weave Code
Code Weaver
Helps Laravel developers discover, compare, and choose open-source packages. See popularity, security, maintainers, and scores at a glance to make better decisions.
Feedback
Share your thoughts, report bugs, or suggest improvements.
Subject
Message

Resource Crawler Bundle Laravel Package

andrew-svirin/resource-crawler-bundle

View on GitHub
Deep Wiki
Context7

Getting Started

Minimal Setup

  1. Installation:
    composer require andrew-svirin/resource-crawler-bundle:dev-main
    
  2. Configure doctrine.yaml (to exclude bundle tables from ORM):
    doctrine:
        dbal:
            schema_filter: ~^(?!resource_crawler_)~
    
  3. Configure resource_crawler.yaml (choose storage backend):
    resource_crawler:
      process:
        is_lockable: true
        store: 'resource_crawler.process_db_store'  # or 'resource_crawler.process_file_store'
        file_store:
          dir: "%kernel.project_dir%/storage/saver"
    
  4. Run migrations:
    php bin/console doctrine:migrations:migrate
    
  5. First Crawl:
    $crawler = $this->get('resource_crawler.crawler');
    $task = $crawler->crawlWebResource(
        'https://example.com',
        ['+example.com/'],  // Include paths
        [],                 // Substitution rules
        null                // Ref handler (optional)
    );
    

First Use Case: Basic Web Crawling

  • Use crawlWebResource() to start crawling a URL with path filters.
  • Check results via analyzeCrawlingWebResource().
  • Example: Crawl a blog to extract all posts (e.g., +example.com/blog/*).

Implementation Patterns

Core Workflow

  1. Task Creation:

    $task = $crawler->crawlWebResource(
        $url,
        $pathMasks,       // e.g., ['+example.com/', '-*.pdf']
        $substitutionRules, // Regex to clean URLs
        $refHandler       // Custom logic per link
    );
    
    • Path Masks: Use + to include, - to exclude (e.g., +example.com/blog/* -example.com/blog/archives).
    • Substitution Rules: Modify URLs dynamically (e.g., remove query params).
  2. Iterative Crawling:

    • The bundle processes URLs in batches (configurable via process settings).
    • Use analyzeCrawlingWebResource() to inspect progress:
      $stats = $crawler->analyzeCrawlingWebResource($url);
      // $stats includes counts of processed/ignored/errored nodes.
      
  3. Handling Results:

    • Access crawled nodes via Doctrine queries (e.g., resource_crawler_nodes table).
    • Example: Fetch all processed HTML nodes:
      $nodes = $entityManager->createQueryBuilder()
          ->select('n.uri_path')
          ->from('App\Entity\ResourceCrawlerNode', 'n')
          ->where('n.status = :status')
          ->setParameter('status', 'processed')
          ->getQuery()
          ->getResult();
      
  4. Custom Logic:

    • Implement RefHandlerClosureInterface for per-link logic:
      $handler = new class() implements RefHandlerClosureInterface {
          public function call(Ref $ref, CrawlingTask $task): void {
              if ($ref->getUriPath() === '/admin') {
                  $task->ignoreNode($ref); // Skip admin pages
              }
          }
      };
      

Integration Tips

  • Queue Crawling: Combine with Symfony Messenger to process tasks asynchronously.
  • Rate Limiting: Use Symfony HTTP Client’s delay option to avoid overloading targets.
  • Storage: For large crawls, prefer process_file_store (faster than DB for metadata-heavy workloads).

Gotchas and Tips

Pitfalls

  1. Locking Issues:

    • If is_lockable: true, ensure your storage backend (DB/file) supports concurrent access.
    • Debug: Check resource_crawler_processes table for stuck locks.
  2. Path Mask Overlap:

    • Conflicting masks (e.g., +example.com/* and -example.com/private) may cause unintended exclusions.
    • Fix: Test with analyzeCrawlingWebResource() before full crawls.
  3. Substitution Rules:

    • Overly aggressive regex (e.g., s/.*//) may break URLs.
    • Tip: Test rules with preg_replace() first.
  4. Memory Limits:

    • Crawling large sites may hit PHP memory limits.
    • Fix: Process in smaller batches or use process_file_store.
  5. HTTPS/HTTP Mixed Content:

    • The crawler follows redirects but may fail on mixed-content warnings.
    • Fix: Configure Symfony HTTP Client to allow insecure content:
      $client = $this->get('http_client');
      $client->getOptions()['allow_http' => true];
      

Debugging

  • Log Crawling: Enable Symfony’s profiler to inspect resource_crawler events.
  • Reset State: Use resetWebResource($url) to clear all data for a domain (useful for retries).
  • Rollback Tasks: If a crawl fails midway, use rollbackTask($task) to reprocess nodes.

Extension Points

  1. Custom Stores: Override ResourceCrawlerBundle\Store\ProcessStoreInterface for alternative backends (e.g., Redis).
  2. Node Filters: Extend ResourceCrawlerBundle\Crawler\NodeFilterInterface to add logic (e.g., skip non-200 responses).
  3. Event Listeners: Subscribe to resource_crawler.node_processed events for post-crawl actions.

Configuration Quirks

  • File Store Directory: Ensure storage/saver is writable (chmod 755).
  • Doctrine Schema Filter: If you do want to use the tables in Doctrine, remove the schema_filter from doctrine.yaml.

Performance Tips

  • Batch Processing: Limit concurrent requests via Symfony HTTP Client’s max_concurrency.
  • Indexing: Add a composite index to resource_crawler_nodes for faster queries:
    CREATE INDEX idx_nodes_process_uri ON resource_crawler_nodes(process_id, uri_path);
    
Weaver

How can I help you explore Laravel packages today?

Conversation history is not saved when not logged in.
Prompt
Add packages to context
No packages found.
daikazu/eloquent-salesforce-objects
unseen-codes/chat
romalytar/yammi-jobs-monitoring-laravel
kisame76/filament-db-table-state
nqxcode/laravel-lucene-search
dpfx/laravel-livewire-wizards
workos/workos-php-laravel
sofa/laravel-global-scope
nawasara/auth-primitives
adhocrat-io/arkhe-main
make-dev/orca-harpoon
itsemon245/lamet
baks-dev/dashboard
amoifr/pickle-panther-bundle
make-dev/orca
dmstr/symfony-system-resources-bundle
dmstr/symfony-job-queue-bundle
dmstr/openapi-json-schema-bundle
dmstr/keycloak-security-bundle
dmstr/doctrine-audit-log-bundle