Weave Code
Code Weaver
Helps Laravel developers discover, compare, and choose open-source packages. See popularity, security, maintainers, and scores at a glance to make better decisions.
Feedback
Share your thoughts, report bugs, or suggest improvements.
Subject
Message

Graby Laravel Package

j0k3r/graby

Graby extracts clean article content from web pages. Built on php-readability and FiveFilters site_config patterns, it’s a composer-friendly, decoupled, fully tested fork of Full-Text RSS. Requires PHP 8.2+, Tidy and cURL.

View on GitHub
Deep Wiki
Context7

Getting Started

Minimal Steps

  1. Installation:

    composer require j0k3r/graby dev-master php-http/guzzle7-adapter
    

    Ensure Tidy and cURL extensions are enabled in your PHP environment.

  2. Basic Usage:

    use Graby\Graby;
    
    $graby = new Graby();
    $result = $graby->fetchContent('https://example.com/article-url');
    
    echo $result->getHtml(); // Extracted article content
    echo $result->getTitle(); // Article title
    
  3. First Use Case: Extract article content from a URL for a news aggregator or content scraping tool. Example:

    $articleUrl = 'https://bbc.com/news/entertainment-arts-32547474';
    $graby = new Graby();
    $content = $graby->fetchContent($articleUrl);
    $this->storeArticle($content->getHtml(), $content->getTitle());
    

Implementation Patterns

Common Workflows

  1. Fetching and Storing Articles:

    $graby = new Graby();
    $urls = ['url1', 'url2', 'url3'];
    
    foreach ($urls as $url) {
        $result = $graby->fetchContent($url);
        $this->saveToDatabase($result);
    }
    
  2. Handling Prefetched Content:

    $graby = new Graby();
    $preFetchedHtml = $this->fetchHtmlFromCache('url');
    $graby->setContentAsPrefetched($preFetchedHtml);
    $result = $graby->fetchContent('url');
    
  3. Cleaning Up HTML:

    $graby = new Graby();
    $rawHtml = $this->getRawHtmlFromSource();
    $cleanedHtml = $graby->cleanupHtml($rawHtml, 'https://example.com');
    

Integration Tips

  • Queue Jobs for Large-Scale Scraping: Use Laravel's queue system to process multiple URLs asynchronously:

    foreach ($urls as $url) {
        ScrapeArticleJob::dispatch($url)->onQueue('scrape');
    }
    
  • Logging and Debugging: Configure Graby to log detailed information for debugging:

    $graby = new Graby([
        'debug' => true,
        'log_level' => 'debug',
    ]);
    
  • Custom HTTP Client Configuration: For advanced use cases, configure the HTTP client manually:

    use GuzzleHttp\Client as GuzzleClient;
    use Http\Adapter\Guzzle7\Client as GuzzleAdapter;
    
    $guzzle = new GuzzleClient(['timeout' => 5]);
    $graby = new Graby([], new GuzzleAdapter($guzzle));
    
  • Handling Errors Gracefully: Check the response status and handle errors appropriately:

    $result = $graby->fetchContent($url);
    if ($result->getResponse()->getStatus() !== 200) {
        $this->handleError($result);
    }
    

Gotchas and Tips

Pitfalls

  1. URL Filtering: Ensure allowed_urls and blocked_urls configurations are set correctly to avoid unintended scraping:

    $graby = new Graby([
        'allowed_urls' => ['example.com', 'trusted-site.org'],
        'blocked_urls' => ['blocked-site.com'],
    ]);
    
  2. Relative URLs: If rewrite_relative_urls is set to false, ensure your application handles relative URLs in the extracted content:

    $graby = new Graby(['rewrite_relative_urls' => false]);
    
  3. XSS Filtering: Enabling xss_filter will strip certain elements like iframes. Disable it if you need to preserve these:

    $graby = new Graby(['xss_filter' => false]);
    
  4. Timeouts: Without manually configuring the HTTP client, Graby uses default timeouts, which might be too short for some sites. Always set a timeout:

    $guzzle = new GuzzleClient(['timeout' => 10]);
    $graby = new Graby([], new GuzzleAdapter($guzzle));
    
  5. Multi-page Articles: Ensure multipage and singlepage are enabled if you expect multi-page articles:

    $graby = new Graby([
        'multipage' => true,
        'singlepage' => true,
    ]);
    

Debugging

  • Log Files: Enable debugging to generate detailed logs in log/graby.log and log/html.log:

    $graby = new Graby(['debug' => true, 'log_level' => 'debug']);
    
  • Monolog Integration: Use the GrabyHandler to log output in Symfony projects:

    services:
        graby.log_handler:
            class: Graby\Monolog\Handler\GrabyHandler
    
  • Check Response Status: Always verify the response status to handle errors:

    if ($result->getResponse()->getStatus() !== 200) {
        Log::error('Failed to fetch URL: ' . $result->getResponse()->getEffectiveUri());
    }
    

Extension Points

  1. Custom Site Configs: Add custom site configurations to handle specific websites:

    $graby = new Graby([
        'extractor' => [
            'config_builder' => [
                'site_config' => [__DIR__ . '/custom-site-configs'],
            ],
        ],
    ]);
    
  2. Custom Filters: Use pre_filters and post_filters to clean or modify HTML content:

    $graby = new Graby([
        'extractor' => [
            'readability' => [
                'pre_filters' => ['/<script.*?>.*?<\/script>/is' => ''],
                'post_filters' => ['/<style.*?>.*?<\/style>/is' => ''],
            ],
        ],
    ]);
    
  3. Custom HTTP Headers: Add custom headers to the HTTP client for specific use cases:

    $guzzle = new GuzzleClient([
        'headers' => [
            'User-Agent' => 'CustomUserAgent/1.0',
            'Accept-Language' => 'en-US',
        ],
    ]);
    $graby = new Graby([], new GuzzleAdapter($guzzle));
    
Weaver

How can I help you explore Laravel packages today?

Conversation history is not saved when not logged in.
Prompt
Add packages to context
No packages found.
emuniq/filament-browser-notifications
syriable/filament-translator
hungnm28/livewire-form
wenprise/eloquent
crudly/encrypted
fadion/bouncy
cuci/prototurk-sdk
gos/pubsub-router-bundle
cuci/prototurk-sdk-symfony
clementtalleu/easyadmin-markdown-bundle
codeflextech/permission-manager
karnoweb/livewire-datepicker
sayedenam/sayed-dashboard
milito/query-filter
apiboxsym/user-bundle
apiboxsym/health-check-bundle
jayeshmepani/jpl-moshier-ephemeris-php
elnasnato/laraliveui
labrodev/rest-sdk
sampaui/sampaui