j0k3r/graby
Graby extracts clean article content from web pages. Built on php-readability and FiveFilters site_config patterns, it’s a composer-friendly, decoupled, fully tested fork of Full-Text RSS. Requires PHP 8.2+, Tidy and cURL.
allowed_urls/blocked_urls to avoid scraping restricted sites.xss_filter) and error handling to ensure clean, usable output.GrabyHandler or tweaking extraction rules via site_config).Adopt if:
Look elsewhere if:
readability-lxml (Python) or cheerio (Node.js).*"Graby is a battle-tested, MIT-licensed PHP package that solves a critical pain point for our [content aggregation/SEO analysis/read-later] product: extracting clean, structured article content from the web at scale. Instead of building a custom scraper (which would require months of dev effort and ongoing maintenance), we can leverage this open-source, well-documented tool to:
Risk: Minimal—we can start with a pilot (e.g., extracting 10K articles/month) and scale. Competitors like [Product X] use similar tools, so we’re not at a disadvantage."*
*"Graby is a drop-in PHP library that replaces manual HTML parsing or fragile regex-based extraction. Here’s why it’s a win:
site_config (e.g., handle WordPress/Blogger sites differently) or override defaults (timeouts, allowed domains).Monolog support), customize output, or pre-process HTML before extraction.Trade-offs:
Proposal:
site_config rules.Alternatives considered:
How can I help you explore Laravel packages today?