atoolo/crawler-teaser-indexer
Installation
composer require atoolo/crawler-teaser-indexer
Register the bundle in config/bundles.php:
Atoolo\Crawler\AtooloCrawlerTeaserIndexerBundle::class => ['all' => true],
Configure Scheduling
Create config/packages/atoolo_crawler_master.yaml with cron schedules and retry status codes:
parameters:
atoolo.crawler.schedule:
- "0 3 * * *" # Daily at 3 AM
atoolo.crawler.retry_status_codes:
- 408
- 500
Define Crawler Configuration
Create base_dir/indexer/atooloTeaserCrawler.php with a minimal PHP array:
return [
"name" => "My External Teaser Indexer",
"data" => [
"sp_crawling_sites" => [
[
"sp_id" => "my_site",
"sp_url" => "https://example.com",
"sp_title_css" => ["h1", ".title"],
"sp_start_urls" => [["sp_url" => "https://example.com/blog", "sp_extraction_depth" => 1]],
]
]
]
];
Run the Crawler
php bin/console crawler:index -vvv
sp_start_urls to point to a blog page.sp_title_css to target blog post titles (e.g., ".post-title").Scheduled Execution
Use the atoolo.crawler.schedule parameter to run crawlers via Symfony's scheduler (e.g., 0 3 * * * for daily runs).
Example:
parameters:
atoolo.crawler.schedule:
- "0 0 * * 1" # Every Monday at midnight
Dynamic Configuration
Store site-specific configurations in atooloTeaserCrawler.php and load them dynamically:
$config = include __DIR__.'/indexer/atooloTeaserCrawler.php';
$crawler = new Crawler($config['data']);
Solr Indexing
After crawling, index results into Solr using the crawler:index command. Ensure your Solr core is configured to accept the schema:
php bin/console crawler:index --env=prod
Error Handling
Use sp_max_retry and sp_delay_ms to handle transient failures:
"sp_max_retry" => 3, // Retry 3 times
"sp_delay_ms" => 1000, // 1-second delay between retries
["h1", ".title", "#main-title"]).sp_allow_prefixes and sp_deny_endings to refine crawling scope:
"sp_allow_prefixes" => ["https://example.com/blog/"],
"sp_deny_endings" => [".pdf", ".jpg"],
sp_content_scoring_active to filter low-quality content:
"sp_content_scoring_active" => true,
"sp_content_scoring_min_score" => 5,
Missing Required Fields
sp_title_css or sp_introText_css (if required) are empty.Solr Schema Mismatch
datetime field but receives a date, indexing fails."sp_datetime_only_date" => true to auto-convert dates to YYYY-MM-DD 00:00:00.Rate Limiting
sp_parallel_requests (e.g., >5) may trigger 429 errors.sp_parallel_requests: 3 and increase gradually.Caching Issues
atoolo_crawler_master.yaml require cache clearing:
php bin/console cache:clear
-vvv to debug crawler behavior:
php bin/console crawler:index -vvv
Custom Extractors
Extend the crawler by implementing a custom ExtractorInterface for non-standard HTML structures.
Post-Processing Use Symfony’s event system to modify extracted data before Solr indexing:
// In services.yaml
Atoolo\Crawler\EventListener\TeaserPostProcessor:
tags:
- { name: kernel.event_listener, event: crawler.teaser.extracted, method: onTeaserExtracted }
Dynamic Config Loading Load configurations from a database or API instead of static files for dynamic sites:
$config = $database->fetchConfigForSite($siteId);
sp_title_opengraph is empty, the crawler falls back to sp_title_css.sp_strip_query_params_active removes params like ?page=2 to avoid duplicates.sp_user_agent (e.g., "MyBot/1.0 (+contact@example.com)") to identify your crawler.
---
```markdown
### Laravel-Specific Adaptations
Since this is a Symfony bundle, integrate it into Laravel via:
1. **Bridge Package**: Use `spatie/laravel-symfony-bundle` to load Symfony bundles in Laravel.
2. **Console Command Proxy**: Create a Laravel Artisan command to call the Symfony crawler:
```php
// app/Console/Commands/RunCrawler.php
namespace App\Console\Commands;
use Symfony\Component\Console\Application;
class RunCrawler extends Command {
protected $signature = 'crawler:run';
public function handle() {
$symfonyApp = new Application();
$symfonyApp->add(new \Atoolo\Crawler\Command\IndexCommand());
$symfonyApp->run();
}
}
AppServiceProvider:
public function boot() {
if ($this->app->environment('local')) {
$this->app->register(\Atoolo\Crawler\AtooloCrawlerTeaserIndexerBundle::class);
}
}
How can I help you explore Laravel packages today?