Weave Code
Code Weaver
Helps Laravel developers discover, compare, and choose open-source packages. See popularity, security, maintainers, and scores at a glance to make better decisions.
Feedback
Share your thoughts, report bugs, or suggest improvements.
Subject
Message

Crawler Laravel Package

spatie/crawler

PHP web crawler that discovers links concurrently via Guzzle, with optional JavaScript rendering powered by Chrome/Puppeteer. Configure depth, internal-only rules, and callbacks for per-page handling, plus a fake mode to test crawl logic without real HTTP requests.

View on GitHub
Deep Wiki
Context7

title: Configuring requests weight: 4

User agent

By default, the crawler identifies itself as *. You can set a custom user agent using the userAgent method.

use Spatie\Crawler\Crawler;

Crawler::create('https://example.com')
    ->userAgent('MyBot/1.0')
    ->start();

The user agent is also used when checking robots.txt rules, so make sure it matches any user agent specific rules you want to respect.

Extra headers

You can add extra headers to every request the crawler makes using the headers method.

use Spatie\Crawler\Crawler;

Crawler::create('https://example.com')
    ->headers([
        'Accept-Language' => 'en-US',
        'X-Custom-Header' => 'value',
    ])
    ->start();

The headers will be merged with the default headers. You can call headers multiple times. Each call will merge the new headers with the previously set ones.

Timeouts

By default, the crawler uses a 10 second timeout for both connecting and receiving a response. You can change these values using the connectTimeout and requestTimeout methods. Both accept a value in seconds.

use Spatie\Crawler\Crawler;

Crawler::create('https://example.com')
    ->connectTimeout(5)
    ->requestTimeout(30)
    ->start();

The connectTimeout method sets the maximum number of seconds to wait while trying to connect to the server. The requestTimeout method sets the maximum number of seconds to wait for the entire request (including the response) to complete.

Authentication

When crawling sites that require authentication, you can use the basicAuth or token methods.

The basicAuth method configures HTTP Basic authentication for all requests.

use Spatie\Crawler\Crawler;

Crawler::create('https://example.com')
    ->basicAuth('username', 'password')
    ->start();

The token method sets an Authorization header. It defaults to the Bearer type.

use Spatie\Crawler\Crawler;

Crawler::create('https://example.com')
    ->token('your-api-token')
    ->start();

You can pass a second argument to change the token type.

use Spatie\Crawler\Crawler;

Crawler::create('https://example.com')
    ->token('your-api-token', 'Token')
    ->start();

SSL verification

When crawling sites with self-signed or invalid SSL certificates (for example, a staging environment), you can disable certificate verification using the withoutVerifying method.

use Spatie\Crawler\Crawler;

Crawler::create('https://staging.example.com')
    ->withoutVerifying()
    ->start();

You should only use this for trusted environments. In production, always keep SSL verification enabled.

Proxy

You can route all crawler requests through a proxy server using the proxy method.

use Spatie\Crawler\Crawler;

Crawler::create('https://example.com')
    ->proxy('http://proxy-server:8080')
    ->start();

This accepts any proxy string supported by Guzzle, including authenticated proxies.

use Spatie\Crawler\Crawler;

Crawler::create('https://example.com')
    ->proxy('http://username:password@proxy-server:8080')
    ->start();

Cookies

You can send cookies with every request using the cookies method. This is useful when you need to crawl a site that requires a session cookie or other cookie based authentication.

use Spatie\Crawler\Crawler;

Crawler::create('https://example.com')
    ->cookies(['session_id' => 'abc123', 'token' => 'xyz'], 'example.com')
    ->start();

The first argument is an array of cookie names and values. The second argument is the domain the cookies belong to.

Query parameters

You can append query parameters to every request the crawler makes using the queryParameters method. This is useful for passing API keys or other parameters that need to be present on every request.

use Spatie\Crawler\Crawler;

Crawler::create('https://example.com')
    ->queryParameters(['api_key' => 'your-key'])
    ->start();

You can call queryParameters multiple times. Each call will merge the new parameters with the previously set ones.

Retrying failed requests

Some servers occasionally return 5xx errors or drop connections. You can configure the crawler to automatically retry failed requests using the retry method.

use Spatie\Crawler\Crawler;

Crawler::create('https://example.com')
    ->retry(times: 2, delayInMs: 500)
    ->start();

The first argument is the maximum number of retries per request. The second argument is the base delay between retries in milliseconds. The delay increases linearly with each attempt (500ms, 1000ms, 1500ms, ...).

A request will be retried when it results in a connection error or a 5xx response status code.

Guzzle middleware

You can add custom Guzzle middleware to the underlying HTTP client using the middleware method. This lets you hook into the request/response lifecycle for logging, caching, modifying headers, or any other purpose.

use GuzzleHttp\Middleware;
use Psr\Http\Message\RequestInterface;
use Spatie\Crawler\Crawler;

Crawler::create('https://example.com')
    ->middleware(Middleware::mapRequest(function (RequestInterface $request) {
        return $request->withHeader('X-Custom-Header', 'value');
    }), 'add-custom-header')
    ->start();

The first argument is a callable that follows Guzzle's middleware signature. The optional second argument is a name for the middleware, which can be useful for debugging.

You can call middleware multiple times to add multiple middlewares. They will be pushed onto the handler stack in the order they are added.

use GuzzleHttp\Middleware;
use Spatie\Crawler\Crawler;

Crawler::create('https://example.com')
    ->middleware($loggingMiddleware, 'logging')
    ->middleware($cachingMiddleware, 'caching')
    ->start();

Custom Guzzle client options

The second argument to Crawler::create() accepts an array of Guzzle request options. These are merged with the crawler's defaults, so you only need to specify the options you want to change.

use GuzzleHttp\RequestOptions;
use Spatie\Crawler\Crawler;

Crawler::create('https://example.com', [
    RequestOptions::STREAM => true,
    RequestOptions::TIMEOUT => 30,
])->start();

The defaults are:

[
    RequestOptions::COOKIES => true,
    RequestOptions::CONNECT_TIMEOUT => 10,
    RequestOptions::TIMEOUT => 10,
    RequestOptions::ALLOW_REDIRECTS => ['track_redirects' => true],
    RequestOptions::HEADERS => ['User-Agent' => '*'],
]

To explicitly remove a default option, set it to null:

Crawler::create('https://example.com', [
    RequestOptions::COOKIES => null, // removes the default COOKIES option entirely
])->start();

Redirects

By default, the crawler follows redirects and tracks the redirect chain. This means that when a URL redirects to another location, the crawler will follow the redirect and use the final URL as the base for extracting links.

If you need to disable redirect following, you can pass custom client options:

use GuzzleHttp\RequestOptions;
use Spatie\Crawler\Crawler;

Crawler::create('https://example.com', [
    RequestOptions::ALLOW_REDIRECTS => false,
])->start();

Streaming responses

For sites with large responses, you can enable streaming to reduce memory usage. When streaming is enabled, response bodies are read in chunks rather than loaded entirely into memory.

use Spatie\Crawler\Crawler;

Crawler::create('https://example.com')
    ->stream()
    ->start();
Weaver

How can I help you explore Laravel packages today?

Conversation history is not saved when not logged in.
Prompt
Add packages to context
No packages found.
jayeshmepani/jpl-moshier-ephemeris-php
elnasnato/laraliveui
labrodev/rest-sdk
sampaui/sampaui
babelqueue/php-sdk
facebook/capi-param-builder-php
babelqueue/symfony
hamzi/corewatch
minionfactory/raw-hydrator
hexters/coinpayment
rjcodes/rjcms
act-training/laravel-permissions-manager
alimarchal/laravel-chart-of-accounts
babenkoivan/elastic-scout-driver
mkwebdesign/filament-watchdog-v5
renatomarinho/laravel-page-speed
zedmagdy/filament-business-hours
renatovdemoura/blade-elements-ui
devgeek/beacon-admin
benjamin-rqt/data-watcher-bundle