Weave Code
Code Weaver
Helps Laravel developers discover, compare, and choose open-source packages. See popularity, security, maintainers, and scores at a glance to make better decisions.
Feedback
Share your thoughts, report bugs, or suggest improvements.
Subject
Message

Laracrawler Laravel Package

anassrojea/laracrawler

View on GitHub
Deep Wiki
Context7

Getting Started

Minimal Setup

  1. Installation:
    composer require anassrojea/laracrawler
    php artisan vendor:publish --provider="AnassRojea\Laracrawler\LaracrawlerServiceProvider"
    
  2. Configure:
    • Publish the config (config/laracrawler.php) and update:
      'crawler' => [
          'base_url' => 'https://yourdomain.com',
          'depth' => 3, // Default crawl depth
          'excludes' => [
              'regex' => ['/admin/', '/login/'],
              'extensions' => ['pdf', 'docx'],
          ],
      ],
      
  3. First Crawl:
    php artisan laracrawler:crawl
    
    • Generates a sitemap.xml in storage/app/sitemaps/.

First Use Case: Basic Sitemap Generation

use AnassRojea\Laracrawler\Facades\Laracrawler;

// Trigger a crawl and generate sitemap
Laracrawler::crawl()->generate();
  • Outputs a standard sitemap.xml with all crawlable URLs.

Implementation Patterns

Core Workflows

  1. Crawling & Indexing:

    • Recursive Crawl:
      Laracrawler::crawl()->depth(5)->exclude('regex', '/private/')->run();
      
    • Dynamic Exclusions:
      Laracrawler::crawl()->exclude('extensions', ['zip', 'exe'])->run();
      
  2. Multilingual Support:

    • Define hreflang in config:
      'hreflang' => [
          'en' => 'https://en.yourdomain.com',
          'fr' => 'https://fr.yourdomain.com',
      ],
      
    • Auto-generate alternates during crawl:
      Laracrawler::crawl()->withHreflang()->run();
      
  3. Priority & Lastmod:

    • Override priority for specific routes:
      Route::get('/important', function () {
          return Laracrawler::priority(0.9)->response(...);
      });
      
    • Use database-driven lastmod:
      Laracrawler::crawl()->lastmodStrategy('db', 'posts.updated_at')->run();
      
  4. Image/Video Sitemaps:

    • Extract metadata from views:
      <!-- Auto-parsed for image sitemap -->
      <img src="/image.jpg" alt="Example" title="Example Title">
      
    • Customize video defaults:
      'video' => [
          'default_title' => 'Video Content',
          'default_description' => 'Default description',
      ],
      

Integration Tips

  • Route Middleware:
    Laracrawler::middleware(function ($request) {
        if ($request->ip() === '123.123.123.123') {
            return Laracrawler::exclude();
        }
    });
    
  • Event Listeners:
    // Listen for crawl completion
    Laracrawler::listen('crawl.completed', function ($urls) {
        Log::info("Crawled URLs: " . count($urls));
    });
    
  • Scheduled Crawls:
    // app/Console/Kernel.php
    $schedule->command('laracrawler:crawl')->daily();
    

Gotchas and Tips

Pitfalls

  1. Crawl Depth Limits:

    • Deep crawls (depth > 3) may hit memory limits. Use chunking:
      Laracrawler::crawl()->chunk(100)->run();
      
    • Monitor with:
      php artisan laracrawler:stats
      
  2. URL Normalization Conflicts:

    • Ensure base_url in config matches production. Misconfigurations cause duplicate entries.
    • Debug with:
      Laracrawler::normalizeUrl('https://example.com/Page?query=1');
      // Output: "https://example.com/page"
      
  3. Dynamic Content Exclusions:

    • Regex exclusions are case-sensitive. Use i modifier:
      Laracrawler::crawl()->exclude('regex', '/private/i')->run();
      
  4. Database lastmod Strategies:

    • Ensure the column exists and is indexed. Slow queries block the crawler.
    • Fallback to file strategy if DB fails:
      Laracrawler::crawl()->lastmodStrategy('file')->run();
      

Debugging

  • Log Crawl Details:
    Laracrawler::crawl()->log()->run();
    // Logs to `storage/logs/laracrawler.log`
    
  • Validate URLs:
    php artisan laracrawler:validate
    // Checks for broken links and noindex tags
    
  • Inspect Excluded URLs:
    Laracrawler::getExcludedUrls();
    // Returns array of filtered URLs
    

Extension Points

  1. Custom Crawlers:

    • Extend AnassRojea\Laracrawler\Crawler:
      class CustomCrawler extends Crawler {
          public function customRule($url) {
              if (str_contains($url, 'special')) {
                  return $this->exclude();
              }
          }
      }
      
    • Register in config/laracrawler.php:
      'crawler' => [
          'class' => \App\CustomCrawler::class,
      ],
      
  2. Sitemap Transformers:

    • Override XML generation:
      Laracrawler::transformer(function ($urls) {
          return new CustomSitemapTransformer($urls);
      });
      
  3. Priority Algorithms:

    • Replace default scoring:
      Laracrawler::priorityStrategy(function ($url, $depth, $links) {
          return $depth === 1 ? 1.0 : 0.5;
      });
      
  4. Asset Indexing:

    • Add custom asset parsers (e.g., for PDFs):
      Laracrawler::assetParser('pdf', function ($content) {
          return ['title' => 'PDF Title', 'caption' => 'PDF Description'];
      });
      
Weaver

How can I help you explore Laravel packages today?

Conversation history is not saved when not logged in.
Prompt
Add packages to context
No packages found.
jayeshmepani/jpl-moshier-ephemeris-php
elnasnato/laraliveui
labrodev/rest-sdk
sampaui/sampaui
babelqueue/php-sdk
facebook/capi-param-builder-php
babelqueue/symfony
hamzi/corewatch
minionfactory/raw-hydrator
hexters/coinpayment
rjcodes/rjcms
act-training/laravel-permissions-manager
alimarchal/laravel-chart-of-accounts
babenkoivan/elastic-scout-driver
mkwebdesign/filament-watchdog-v5
renatomarinho/laravel-page-speed
zedmagdy/filament-business-hours
renatovdemoura/blade-elements-ui
devgeek/beacon-admin
benjamin-rqt/data-watcher-bundle