Weave Code
Code Weaver
Helps Laravel developers discover, compare, and choose open-source packages. See popularity, security, maintainers, and scores at a glance to make better decisions.
Feedback
Share your thoughts, report bugs, or suggest improvements.
Subject
Message

Graby Laravel Package

j0k3r/graby

Graby extracts clean article content from web pages. Built on php-readability and FiveFilters site_config patterns, it’s a composer-friendly, decoupled, fully tested fork of Full-Text RSS. Requires PHP 8.2+, Tidy and cURL.

View on GitHub
Deep Wiki
Context7

Technical Evaluation

Architecture Fit

  • Content Extraction Use Case: Graby excels at article/content extraction from web pages, aligning well with Laravel-based applications requiring web scraping, RSS aggregation, or content syndication (e.g., news platforms, aggregators, or SEO tools).
  • Decoupled Design: Leverages HTTPlug for HTTP clients (Guzzle/cURL), making it vendor-agnostic and compatible with Laravel’s HTTP stack (e.g., GuzzleHttp).
  • Site-Specific Configs: Supports custom site patterns (via site_config), enabling fine-grained control over extraction logic for known domains (e.g., WordPress, Blogger).
  • Laravel Integration Points:
    • Queueable: Fetching content can be asynchronous (e.g., via Laravel Queues) to avoid timeouts.
    • Caching: Results can be cached (e.g., Illuminate\Cache) to reduce redundant fetches.
    • Logging: Built-in Monolog support for debugging extraction failures.

Integration Feasibility

  • Low Friction: Composer-based installation with zero Laravel-specific dependencies (beyond HTTP clients).
  • Prefetching: Supports pre-fetched HTML (useful for Laravel’s Blade/Views or API responses).
  • HTML Cleanup: Can sanitize/clean extracted content (e.g., for storage or display).
  • Configuration Override: Laravel’s service container can inject custom configs (e.g., allowed_urls, blocked_urls) via bindings.

Technical Risk

  • PHP 8.2 Requirement: May conflict with legacy Laravel apps (pre-8.2). Mitigation: Use Graby 2.x (stable branch) if needed.
  • Dependency Conflicts: php-http/guzzle7-adapter could clash with Laravel’s Guzzle version. Solution: Isolate via Composer’s replace or use php-http/curl-client.
  • Rate Limiting: No built-in throttling; risk of IP bans if scraping aggressively. Mitigation: Implement Laravel middleware or queue delays.
  • Dynamic Content: May fail on JavaScript-rendered pages (e.g., SPAs). Workaround: Use Puppeteer/Laravel BrowserKit for pre-fetching.
  • Maintenance: Fork is active (last release 2026), but no Laravel-specific docs. Risk: Undocumented edge cases (e.g., CSRF tokens, auth headers).

Key Questions

  1. Use Case Clarity:
    • Is extraction for internal processing (e.g., indexing) or public display (e.g., user-facing articles)?
    • Are there legal/ethical constraints (e.g., robots.txt, copyright)?
  2. Performance:
    • Expected scale (e.g., 100 vs. 10,000 URLs/day)? Queue depth and caching strategy needed.
    • Will parallel requests be required? (Use Laravel’s parallel:batch or spatie/async.)
  3. Error Handling:
    • How to handle failed extractions? (Log? Retry? Fallback to raw HTML?)
    • Need custom error messages (e.g., for user notifications)?
  4. Extensibility:
    • Will custom site configs be needed? (Host on S3/DB for dynamic updates?)
    • Require post-processing (e.g., NLP, entity extraction)? Integrate with Laravel’s spatie/array-to-xml or spatie/pdf-to-text.
  5. Monitoring:
    • Track extraction success rates? Use Laravel’s Horizon for queue metrics.
    • Alert on high failure rates? Integrate with laravel-monitor or sentry.

Integration Approach

Stack Fit

  • Laravel Compatibility:
    • HTTP Clients: Native support for Guzzle 7 (Laravel’s default) via php-http/guzzle7-adapter.
    • Service Container: Bind Graby as a singleton with custom configs:
      $app->singleton(Graby::class, function ($app) {
          return new Graby([
              'allowed_urls' => ['example.com', 'trusted-site.org'],
              'debug' => env('GRABY_DEBUG', false),
          ]);
      });
      
    - **Queue Jobs**: Wrap extraction in a **job** (e.g., `ExtractContentJob`) for async processing.
    
  • Database:
    • Store extracted content in Laravel Eloquent (e.g., Article model) or Laravel Scout for search.
    • Cache responses with Redis or database cache.
  • Frontend:
    • Serve cleaned HTML via Blade templates or API responses (e.g., GET /articles/{id}/content).

Migration Path

  1. Pilot Phase:
    • Test with 10–20 target URLs to validate extraction quality.
    • Compare against manual checks or existing scrapers (e.g., symfony/dom-crawler).
  2. Incremental Rollout:
    • Start with non-critical endpoints (e.g., admin panels for content review).
    • Gradually replace legacy scrapers (e.g., file_get_contents hacks).
  3. Fallback Strategy:
    • Cache raw HTML as fallback if extraction fails.
    • Implement user-triggered retries (e.g., "Retry Extraction" button).

Compatibility

  • Laravel Versions:
    • Laravel 10+: Use Graby 3.x (PHP 8.2+).
    • Laravel 8/9: Use Graby 2.x (PHP 7.4+).
  • HTTP Middleware:
    • Add Laravel middleware to inject headers (e.g., Accept-Language, User-Agent):
      public function handle(Request $request, Closure $next) {
          $request->headers->set('User-Agent', config('graby.http_client.ua_browser'));
          return $next($request);
      }
      
  • CORS/Proxy:
    • If scraping cross-domain, use Laravel’s queue proxies or serverless functions (e.g., AWS Lambda) to avoid CORS issues.

Sequencing

  1. Setup:
    • Install dependencies:
      composer require j0k3r/graby php-http/guzzle7-adapter
      
    • Configure config/graby.php (merge defaults with app-specific settings).
  2. Core Integration:
    • Create a service class to wrap Graby (e.g., app/Services/ContentExtractor.php):
      public function extract(string $url): Article {
          $result = app(Graby::class)->fetchContent($url);
          return Article::create([
              'title' => $result->getTitle(),
              'content' => $result->getHtml(),
              // ...
          ]);
      }
      
  3. Async Processing:
    • Dispatch jobs for bulk extraction:
      ExtractContentJob::dispatch($url)->onQueue('scraping');
      
  4. Monitoring:
    • Log extraction metrics (e.g., success rate, duration) via Laravel’s logging channels.
    • Set up alerts for failures (e.g., >5% error rate).

Operational Impact

Maintenance

  • Configuration Management:
    • Site-Specific Rules: Store site_config files in Laravel’s storage/app/site_configs/ or a database table for dynamic updates.
    • Environment-Specific Configs: Use Laravel’s environment configs (e.g., .env) to toggle features like debug or xss_filter.
  • Dependency Updates:
    • Monitor Graby releases and HTTPlug adapters for breaking changes.
    • Test upgrades in a staging environment with a subset of URLs.
  • Logging:
    • Route Graby logs to Laravel’s Monolog (e.g., single or daily handlers).
    • Example config:
      'graby' => [
          'driver' => 'single',
          'path' => storage_path('logs/graby.log'),
          'level' => 'debug',
      ],
      

Support

  • Common Issues:
    • Failed Extractions: Debug with log_level: debug to inspect HTML at each step.
    • Relative URLs: Ensure rewrite_relative_urls: true is set.
    • Dynamic Content: Pre-fetch with Puppeteer or Laravel BrowserKit.
  • User Support:
    • Provide fallback content (e.g., "Extraction failed; showing raw page").
    • Offer manual override for critical articles (e.g., admin panel to force-retrieve).

Scaling

  • Horizontal Scaling:
    • Use Laravel Queues (Redis/Database) to distribute
Weaver

How can I help you explore Laravel packages today?

Conversation history is not saved when not logged in.
Prompt
Add packages to context
No packages found.
daikazu/eloquent-salesforce-objects
unseen-codes/chat
romalytar/yammi-jobs-monitoring-laravel
kisame76/filament-db-table-state
nqxcode/laravel-lucene-search
dpfx/laravel-livewire-wizards
workos/workos-php-laravel
sofa/laravel-global-scope
nawasara/auth-primitives
adhocrat-io/arkhe-main
make-dev/orca-harpoon
itsemon245/lamet
baks-dev/dashboard
amoifr/pickle-panther-bundle
make-dev/orca
dmstr/symfony-system-resources-bundle
dmstr/symfony-job-queue-bundle
dmstr/openapi-json-schema-bundle
dmstr/keycloak-security-bundle
dmstr/doctrine-audit-log-bundle