j0k3r/graby
Graby extracts clean article content from web pages. Built on php-readability and FiveFilters site_config patterns, it’s a composer-friendly, decoupled, fully tested fork of Full-Text RSS. Requires PHP 8.2+, Tidy and cURL.
Installation:
composer require j0k3r/graby dev-master php-http/guzzle7-adapter
Ensure Tidy and cURL extensions are enabled in your PHP environment.
Basic Usage:
use Graby\Graby;
$graby = new Graby();
$result = $graby->fetchContent('https://example.com/article-url');
echo $result->getHtml(); // Extracted article content
echo $result->getTitle(); // Article title
First Use Case: Extract article content from a URL for a news aggregator or content scraping tool. Example:
$articleUrl = 'https://bbc.com/news/entertainment-arts-32547474';
$graby = new Graby();
$content = $graby->fetchContent($articleUrl);
$this->storeArticle($content->getHtml(), $content->getTitle());
Fetching and Storing Articles:
$graby = new Graby();
$urls = ['url1', 'url2', 'url3'];
foreach ($urls as $url) {
$result = $graby->fetchContent($url);
$this->saveToDatabase($result);
}
Handling Prefetched Content:
$graby = new Graby();
$preFetchedHtml = $this->fetchHtmlFromCache('url');
$graby->setContentAsPrefetched($preFetchedHtml);
$result = $graby->fetchContent('url');
Cleaning Up HTML:
$graby = new Graby();
$rawHtml = $this->getRawHtmlFromSource();
$cleanedHtml = $graby->cleanupHtml($rawHtml, 'https://example.com');
Queue Jobs for Large-Scale Scraping: Use Laravel's queue system to process multiple URLs asynchronously:
foreach ($urls as $url) {
ScrapeArticleJob::dispatch($url)->onQueue('scrape');
}
Logging and Debugging: Configure Graby to log detailed information for debugging:
$graby = new Graby([
'debug' => true,
'log_level' => 'debug',
]);
Custom HTTP Client Configuration: For advanced use cases, configure the HTTP client manually:
use GuzzleHttp\Client as GuzzleClient;
use Http\Adapter\Guzzle7\Client as GuzzleAdapter;
$guzzle = new GuzzleClient(['timeout' => 5]);
$graby = new Graby([], new GuzzleAdapter($guzzle));
Handling Errors Gracefully: Check the response status and handle errors appropriately:
$result = $graby->fetchContent($url);
if ($result->getResponse()->getStatus() !== 200) {
$this->handleError($result);
}
URL Filtering:
Ensure allowed_urls and blocked_urls configurations are set correctly to avoid unintended scraping:
$graby = new Graby([
'allowed_urls' => ['example.com', 'trusted-site.org'],
'blocked_urls' => ['blocked-site.com'],
]);
Relative URLs:
If rewrite_relative_urls is set to false, ensure your application handles relative URLs in the extracted content:
$graby = new Graby(['rewrite_relative_urls' => false]);
XSS Filtering:
Enabling xss_filter will strip certain elements like iframes. Disable it if you need to preserve these:
$graby = new Graby(['xss_filter' => false]);
Timeouts: Without manually configuring the HTTP client, Graby uses default timeouts, which might be too short for some sites. Always set a timeout:
$guzzle = new GuzzleClient(['timeout' => 10]);
$graby = new Graby([], new GuzzleAdapter($guzzle));
Multi-page Articles:
Ensure multipage and singlepage are enabled if you expect multi-page articles:
$graby = new Graby([
'multipage' => true,
'singlepage' => true,
]);
Log Files:
Enable debugging to generate detailed logs in log/graby.log and log/html.log:
$graby = new Graby(['debug' => true, 'log_level' => 'debug']);
Monolog Integration: Use the GrabyHandler to log output in Symfony projects:
services:
graby.log_handler:
class: Graby\Monolog\Handler\GrabyHandler
Check Response Status: Always verify the response status to handle errors:
if ($result->getResponse()->getStatus() !== 200) {
Log::error('Failed to fetch URL: ' . $result->getResponse()->getEffectiveUri());
}
Custom Site Configs: Add custom site configurations to handle specific websites:
$graby = new Graby([
'extractor' => [
'config_builder' => [
'site_config' => [__DIR__ . '/custom-site-configs'],
],
],
]);
Custom Filters:
Use pre_filters and post_filters to clean or modify HTML content:
$graby = new Graby([
'extractor' => [
'readability' => [
'pre_filters' => ['/<script.*?>.*?<\/script>/is' => ''],
'post_filters' => ['/<style.*?>.*?<\/style>/is' => ''],
],
],
]);
Custom HTTP Headers: Add custom headers to the HTTP client for specific use cases:
$guzzle = new GuzzleClient([
'headers' => [
'User-Agent' => 'CustomUserAgent/1.0',
'Accept-Language' => 'en-US',
],
]);
$graby = new Graby([], new GuzzleAdapter($guzzle));
How can I help you explore Laravel packages today?