Weave Code
Code Weaver
Helps Laravel developers discover, compare, and choose open-source packages. See popularity, security, maintainers, and scores at a glance to make better decisions.
Feedback
Share your thoughts, report bugs, or suggest improvements.
Subject
Message

Crawler Teaser Indexer Laravel Package

atoolo/crawler-teaser-indexer

View on GitHub
Deep Wiki
Context7

Getting Started

Minimal Setup for First Use

  1. Installation

    composer require atoolo/crawler-teaser-indexer
    

    Register the bundle in config/bundles.php:

    Atoolo\Crawler\AtooloCrawlerTeaserIndexerBundle::class => ['all' => true],
    
  2. Configure Scheduling Create config/packages/atoolo_crawler_master.yaml with cron schedules and retry status codes:

    parameters:
      atoolo.crawler.schedule:
        - "0 3 * * *"  # Daily at 3 AM
      atoolo.crawler.retry_status_codes:
        - 408
        - 500
    
  3. Define Crawler Configuration Create base_dir/indexer/atooloTeaserCrawler.php with a minimal PHP array:

    return [
      "name" => "My External Teaser Indexer",
      "data" => [
        "sp_crawling_sites" => [
          [
            "sp_id" => "my_site",
            "sp_url" => "https://example.com",
            "sp_title_css" => ["h1", ".title"],
            "sp_start_urls" => [["sp_url" => "https://example.com/blog", "sp_extraction_depth" => 1]],
          ]
        ]
      ]
    ];
    
  4. Run the Crawler

    php bin/console crawler:index -vvv
    

First Use Case: Indexing a Blog

  • Configure sp_start_urls to point to a blog page.
  • Set sp_title_css to target blog post titles (e.g., ".post-title").
  • Run the crawler to extract and index teasers into Solr.

Implementation Patterns

Workflow Integration

  1. Scheduled Execution Use the atoolo.crawler.schedule parameter to run crawlers via Symfony's scheduler (e.g., 0 3 * * * for daily runs). Example:

    parameters:
      atoolo.crawler.schedule:
        - "0 0 * * 1"  # Every Monday at midnight
    
  2. Dynamic Configuration Store site-specific configurations in atooloTeaserCrawler.php and load them dynamically:

    $config = include __DIR__.'/indexer/atooloTeaserCrawler.php';
    $crawler = new Crawler($config['data']);
    
  3. Solr Indexing After crawling, index results into Solr using the crawler:index command. Ensure your Solr core is configured to accept the schema:

    php bin/console crawler:index --env=prod
    
  4. Error Handling Use sp_max_retry and sp_delay_ms to handle transient failures:

    "sp_max_retry" => 3,       // Retry 3 times
    "sp_delay_ms" => 1000,    // 1-second delay between retries
    

Common Patterns

  • CSS Selector Flexibility: Always provide fallback selectors (e.g., ["h1", ".title", "#main-title"]).
  • URL Filtering: Use sp_allow_prefixes and sp_deny_endings to refine crawling scope:
    "sp_allow_prefixes" => ["https://example.com/blog/"],
    "sp_deny_endings" => [".pdf", ".jpg"],
    
  • Content Scoring: Implement sp_content_scoring_active to filter low-quality content:
    "sp_content_scoring_active" => true,
    "sp_content_scoring_min_score" => 5,
    

Gotchas and Tips

Pitfalls

  1. Missing Required Fields

    • Issue: The crawler skips URLs if sp_title_css or sp_introText_css (if required) are empty.
    • Fix: Always provide at least one valid selector for mandatory fields.
  2. Solr Schema Mismatch

    • Issue: If Solr expects a datetime field but receives a date, indexing fails.
    • Fix: Set "sp_datetime_only_date" => true to auto-convert dates to YYYY-MM-DD 00:00:00.
  3. Rate Limiting

    • Issue: High sp_parallel_requests (e.g., >5) may trigger 429 errors.
    • Fix: Start with sp_parallel_requests: 3 and increase gradually.
  4. Caching Issues

    • Issue: Changes to atoolo_crawler_master.yaml require cache clearing:
      php bin/console cache:clear
      

Debugging Tips

  • Verbose Logging: Use -vvv to debug crawler behavior:
    php bin/console crawler:index -vvv
    
  • Dry Runs: Test selectors with a browser’s DevTools before configuring the crawler.
  • Solr Validation: Check Solr logs for schema errors if indexing fails.

Extension Points

  1. Custom Extractors Extend the crawler by implementing a custom ExtractorInterface for non-standard HTML structures.

  2. Post-Processing Use Symfony’s event system to modify extracted data before Solr indexing:

    // In services.yaml
    Atoolo\Crawler\EventListener\TeaserPostProcessor:
      tags:
        - { name: kernel.event_listener, event: crawler.teaser.extracted, method: onTeaserExtracted }
    
  3. Dynamic Config Loading Load configurations from a database or API instead of static files for dynamic sites:

    $config = $database->fetchConfigForSite($siteId);
    

Configuration Quirks

  • OpenGraph Fallback: If sp_title_opengraph is empty, the crawler falls back to sp_title_css.
  • Query Parameter Handling: sp_strip_query_params_active removes params like ?page=2 to avoid duplicates.
  • User-Agent: Always set a descriptive sp_user_agent (e.g., "MyBot/1.0 (+contact@example.com)") to identify your crawler.

---
```markdown
### Laravel-Specific Adaptations
Since this is a Symfony bundle, integrate it into Laravel via:
1. **Bridge Package**: Use `spatie/laravel-symfony-bundle` to load Symfony bundles in Laravel.
2. **Console Command Proxy**: Create a Laravel Artisan command to call the Symfony crawler:
   ```php
   // app/Console/Commands/RunCrawler.php
   namespace App\Console\Commands;
   use Symfony\Component\Console\Application;
   class RunCrawler extends Command {
       protected $signature = 'crawler:run';
       public function handle() {
           $symfonyApp = new Application();
           $symfonyApp->add(new \Atoolo\Crawler\Command\IndexCommand());
           $symfonyApp->run();
       }
   }
  1. Service Provider: Register the bundle in AppServiceProvider:
    public function boot() {
        if ($this->app->environment('local')) {
            $this->app->register(\Atoolo\Crawler\AtooloCrawlerTeaserIndexerBundle::class);
        }
    }
    
Weaver

How can I help you explore Laravel packages today?

Conversation history is not saved when not logged in.
Prompt
Add packages to context
No packages found.
comsave/common
alecsammon/php-raml-parser
chrome-php/wrench
lendable/composer-license-checker
typhoon/reflection
mesilov/moneyphp-percentage
mike42/gfx-php
bookdown/themes
aura/view
aura/html
aura/cli
povils/phpmnd
nayjest/manipulator
omnipay/tests
psr-mock/http-message-implementation
psr-mock/http-factory-implementation
psr-mock/http-client-implementation
voku/email-check
voku/urlify
rtheunissen/guzzle-log-middleware