Technical Evaluation

Architecture Fit

Content Extraction Use Case: Graby excels at article/content extraction from web pages, aligning well with Laravel-based applications requiring web scraping, RSS aggregation, or content syndication (e.g., news platforms, aggregators, or SEO tools).
Decoupled Design: Leverages HTTPlug for HTTP clients (Guzzle/cURL), making it vendor-agnostic and compatible with Laravel’s HTTP stack (e.g., GuzzleHttp).
Site-Specific Configs: Supports custom site patterns (via site_config), enabling fine-grained control over extraction logic for known domains (e.g., WordPress, Blogger).
Laravel Integration Points:
- Queueable: Fetching content can be asynchronous (e.g., via Laravel Queues) to avoid timeouts.
- Caching: Results can be cached (e.g., Illuminate\Cache) to reduce redundant fetches.
- Logging: Built-in Monolog support for debugging extraction failures.

Integration Feasibility

Low Friction: Composer-based installation with zero Laravel-specific dependencies (beyond HTTP clients).
Prefetching: Supports pre-fetched HTML (useful for Laravel’s Blade/Views or API responses).
HTML Cleanup: Can sanitize/clean extracted content (e.g., for storage or display).
Configuration Override: Laravel’s service container can inject custom configs (e.g., allowed_urls, blocked_urls) via bindings.

Technical Risk

PHP 8.2 Requirement: May conflict with legacy Laravel apps (pre-8.2). Mitigation: Use Graby 2.x (stable branch) if needed.
Dependency Conflicts: php-http/guzzle7-adapter could clash with Laravel’s Guzzle version. Solution: Isolate via Composer’s replace or use php-http/curl-client.
Rate Limiting: No built-in throttling; risk of IP bans if scraping aggressively. Mitigation: Implement Laravel middleware or queue delays.
Dynamic Content: May fail on JavaScript-rendered pages (e.g., SPAs). Workaround: Use Puppeteer/Laravel BrowserKit for pre-fetching.
Maintenance: Fork is active (last release 2026), but no Laravel-specific docs. Risk: Undocumented edge cases (e.g., CSRF tokens, auth headers).

Key Questions

Use Case Clarity:
- Is extraction for internal processing (e.g., indexing) or public display (e.g., user-facing articles)?
- Are there legal/ethical constraints (e.g., robots.txt, copyright)?
Performance:
- Expected scale (e.g., 100 vs. 10,000 URLs/day)? Queue depth and caching strategy needed.
- Will parallel requests be required? (Use Laravel’s parallel:batch or spatie/async.)
Error Handling:
- How to handle failed extractions? (Log? Retry? Fallback to raw HTML?)
- Need custom error messages (e.g., for user notifications)?
Extensibility:
- Will custom site configs be needed? (Host on S3/DB for dynamic updates?)
- Require post-processing (e.g., NLP, entity extraction)? Integrate with Laravel’s spatie/array-to-xml or spatie/pdf-to-text.
Monitoring:
- Track extraction success rates? Use Laravel’s Horizon for queue metrics.
- Alert on high failure rates? Integrate with laravel-monitor or sentry.

Integration Approach

Stack Fit

Laravel Compatibility:

HTTP Clients: Native support for Guzzle 7 (Laravel’s default) via php-http/guzzle7-adapter.

Service Container: Bind Graby as a singleton with custom configs:

$app->singleton(Graby::class, function ($app) {
    return new Graby([
        'allowed_urls' => ['example.com', 'trusted-site.org'],
        'debug' => env('GRABY_DEBUG', false),
    ]);
});

- **Queue Jobs**: Wrap extraction in a **job** (e.g., `ExtractContentJob`) for async processing.

Database:
- Store extracted content in Laravel Eloquent (e.g., Article model) or Laravel Scout for search.
- Cache responses with Redis or database cache.
Frontend:
- Serve cleaned HTML via Blade templates or API responses (e.g., GET /articles/{id}/content).

Migration Path

Pilot Phase:
- Test with 10–20 target URLs to validate extraction quality.
- Compare against manual checks or existing scrapers (e.g., symfony/dom-crawler).
Incremental Rollout:
- Start with non-critical endpoints (e.g., admin panels for content review).
- Gradually replace legacy scrapers (e.g., file_get_contents hacks).
Fallback Strategy:
- Cache raw HTML as fallback if extraction fails.
- Implement user-triggered retries (e.g., "Retry Extraction" button).

Compatibility

Laravel Versions:
- Laravel 10+: Use Graby 3.x (PHP 8.2+).
- Laravel 8/9: Use Graby 2.x (PHP 7.4+).

HTTP Middleware:

Add Laravel middleware to inject headers (e.g., Accept-Language, User-Agent):

public function handle(Request $request, Closure $next) {
    $request->headers->set('User-Agent', config('graby.http_client.ua_browser'));
    return $next($request);
}

CORS/Proxy:
- If scraping cross-domain, use Laravel’s queue proxies or serverless functions (e.g., AWS Lambda) to avoid CORS issues.

Sequencing

Setup:
- Install dependencies:
```
composer require j0k3r/graby php-http/guzzle7-adapter
```
- Configure config/graby.php (merge defaults with app-specific settings).

Core Integration:

Create a service class to wrap Graby (e.g., app/Services/ContentExtractor.php):

public function extract(string $url): Article {
    $result = app(Graby::class)->fetchContent($url);
    return Article::create([
        'title' => $result->getTitle(),
        'content' => $result->getHtml(),
        // ...
    ]);
}

Async Processing:

Dispatch jobs for bulk extraction:

ExtractContentJob::dispatch($url)->onQueue('scraping');

Monitoring:
- Log extraction metrics (e.g., success rate, duration) via Laravel’s logging channels.
- Set up alerts for failures (e.g., >5% error rate).

Operational Impact

Maintenance

Configuration Management:
- Site-Specific Rules: Store site_config files in Laravel’s storage/app/site_configs/ or a database table for dynamic updates.
- Environment-Specific Configs: Use Laravel’s environment configs (e.g., .env) to toggle features like debug or xss_filter.
Dependency Updates:
- Monitor Graby releases and HTTPlug adapters for breaking changes.
- Test upgrades in a staging environment with a subset of URLs.

Logging:

Route Graby logs to Laravel’s Monolog (e.g., single or daily handlers).

Example config:

'graby' => [
    'driver' => 'single',
    'path' => storage_path('logs/graby.log'),
    'level' => 'debug',
],

Support

Common Issues:
- Failed Extractions: Debug with log_level: debug to inspect HTML at each step.
- Relative URLs: Ensure rewrite_relative_urls: true is set.
- Dynamic Content: Pre-fetch with Puppeteer or Laravel BrowserKit.
User Support:
- Provide fallback content (e.g., "Extraction failed; showing raw page").
- Offer manual override for critical articles (e.g., admin panel to force-retrieve).

Scaling

Horizontal Scaling:
- Use Laravel Queues (Redis/Database) to distribute

Graby Laravel Package