Getting Started

Minimal Steps

Installation:
```
composer require j0k3r/graby dev-master php-http/guzzle7-adapter
```
Ensure Tidy and cURL extensions are enabled in your PHP environment.

Basic Usage:

use Graby\Graby;

$graby = new Graby();
$result = $graby->fetchContent('https://example.com/article-url');

echo $result->getHtml(); // Extracted article content
echo $result->getTitle(); // Article title

First Use Case: Extract article content from a URL for a news aggregator or content scraping tool. Example:

$articleUrl = 'https://bbc.com/news/entertainment-arts-32547474';
$graby = new Graby();
$content = $graby->fetchContent($articleUrl);
$this->storeArticle($content->getHtml(), $content->getTitle());

Implementation Patterns

Common Workflows

Fetching and Storing Articles:

$graby = new Graby();
$urls = ['url1', 'url2', 'url3'];

foreach ($urls as $url) {
    $result = $graby->fetchContent($url);
    $this->saveToDatabase($result);
}

Handling Prefetched Content:

$graby = new Graby();
$preFetchedHtml = $this->fetchHtmlFromCache('url');
$graby->setContentAsPrefetched($preFetchedHtml);
$result = $graby->fetchContent('url');

Cleaning Up HTML:

$graby = new Graby();
$rawHtml = $this->getRawHtmlFromSource();
$cleanedHtml = $graby->cleanupHtml($rawHtml, 'https://example.com');

Integration Tips

Queue Jobs for Large-Scale Scraping: Use Laravel's queue system to process multiple URLs asynchronously:
```
foreach ($urls as $url) {
    ScrapeArticleJob::dispatch($url)->onQueue('scrape');
}
```
Logging and Debugging: Configure Graby to log detailed information for debugging:
```
$graby = new Graby([
    'debug' => true,
    'log_level' => 'debug',
]);
```

Custom HTTP Client Configuration: For advanced use cases, configure the HTTP client manually:

use GuzzleHttp\Client as GuzzleClient;
use Http\Adapter\Guzzle7\Client as GuzzleAdapter;

$guzzle = new GuzzleClient(['timeout' => 5]);
$graby = new Graby([], new GuzzleAdapter($guzzle));

Handling Errors Gracefully: Check the response status and handle errors appropriately:

$result = $graby->fetchContent($url);
if ($result->getResponse()->getStatus() !== 200) {
    $this->handleError($result);
}

Gotchas and Tips

Pitfalls

URL Filtering: Ensure allowed_urls and blocked_urls configurations are set correctly to avoid unintended scraping:

$graby = new Graby([
    'allowed_urls' => ['example.com', 'trusted-site.org'],
    'blocked_urls' => ['blocked-site.com'],
]);

Relative URLs: If rewrite_relative_urls is set to false, ensure your application handles relative URLs in the extracted content:
```
$graby = new Graby(['rewrite_relative_urls' => false]);
```
XSS Filtering: Enabling xss_filter will strip certain elements like iframes. Disable it if you need to preserve these:
```
$graby = new Graby(['xss_filter' => false]);
```
Timeouts: Without manually configuring the HTTP client, Graby uses default timeouts, which might be too short for some sites. Always set a timeout:
```
$guzzle = new GuzzleClient(['timeout' => 10]);
$graby = new Graby([], new GuzzleAdapter($guzzle));
```
Multi-page Articles: Ensure multipage and singlepage are enabled if you expect multi-page articles:
```
$graby = new Graby([
    'multipage' => true,
    'singlepage' => true,
]);
```

Debugging

Log Files: Enable debugging to generate detailed logs in log/graby.log and log/html.log:
```
$graby = new Graby(['debug' => true, 'log_level' => 'debug']);
```

Monolog Integration: Use the GrabyHandler to log output in Symfony projects:

services:
    graby.log_handler:
        class: Graby\Monolog\Handler\GrabyHandler

Check Response Status: Always verify the response status to handle errors:

if ($result->getResponse()->getStatus() !== 200) {
    Log::error('Failed to fetch URL: ' . $result->getResponse()->getEffectiveUri());
}

Extension Points

Custom Site Configs: Add custom site configurations to handle specific websites:

$graby = new Graby([
    'extractor' => [
        'config_builder' => [
            'site_config' => [__DIR__ . '/custom-site-configs'],
        ],
    ],
]);

Custom Filters: Use pre_filters and post_filters to clean or modify HTML content:

$graby = new Graby([
    'extractor' => [
        'readability' => [
            'pre_filters' => ['/<script.*?>.*?<\/script>/is' => ''],
            'post_filters' => ['/<style.*?>.*?<\/style>/is' => ''],
        ],
    ],
]);

Custom HTTP Headers: Add custom headers to the HTTP client for specific use cases:

$guzzle = new GuzzleClient([
    'headers' => [
        'User-Agent' => 'CustomUserAgent/1.0',
        'Accept-Language' => 'en-US',
    ],
]);
$graby = new Graby([], new GuzzleAdapter($guzzle));

Graby Laravel Package