spatie/crawler
Fast, concurrent web crawler for PHP. Crawl sites, collect internal URLs with depth limits, and hook into crawl events. Can execute JavaScript via Chrome/Puppeteer for rendered pages. Includes fakes for testing crawl logic without real HTTP requests.
Full Changelog: https://github.com/spatie/crawler/compare/9.0.1...9.2.1
Full Changelog: https://github.com/spatie/crawler/compare/9.1.0...9.2.0
allow_redirects default: changed from false to ['track_redirects' => true] so redirects are followed and the redirect history header is populated correctlyallowedMimeTypes) now notify observers via crawled() with an empty body instead of being silently skipped/../, /./) in extracted URLs are now normalized per RFC 3986Crawler::create() now merge with defaults instead of replacing them (pass null to remove a default)CrawlRequestFailed now wraps non-RequestException errors so observers always receive a RequestExceptionUrl, ResponseWithCachedBody, InvalidUrlstream() method to opt-in to streaming HTTP responses for reduced memory usagematchWww() method to treat www.example.com and example.com as equivalent when using internalOnly()includeSubdomains() now works as a flag on internalOnly() and composes with matchWww()CrawlResponse::redirectHistory() and CrawlResponse::wasRedirected() for inspecting redirect chainsCrawlObserver::crawlFailed() now receives a ?TransferStatistics parameter for detecting timeoutsaddObserver() now accepts variadic arguments: addObserver($obs1, $obs2)CrawlRequestFailed now preserves the original request from ConnectException (retaining custom headers like X-Started-At)Major rewrite. See UPGRADING.md for a full list of breaking changes.
UriInterface with plain string URLs throughout the APIResponseInterface with CrawlResponse in observer callbacksCrawlProfile is now an interface instead of an abstract classCrawlObserverCollection no longer implements ArrayAccess or Iteratorhttp to httpssuggest)UrlParser interface redesigned to return ExtractedUrl[] instead of adding to queue directlyCrawlQueue::has() now accepts string instead of CrawlUrl|UriInterfacestart() now returns a FinishReason enumCrawler::create()CrawlResponse object with status(), body(), dom(), header(), transferStats(), and moreCrawlProgress tracking with urlsCrawled, urlsFailed, urlsFound, urlsPendingFinishReason enum: Completed, CrawlLimitReached, TimeLimitReached, InterruptedonCrawled(), onFailed(), onFinished(), onWillCrawl()foundUrls() to collect all URLs as CrawledUrl objectsfake() for testing without HTTP requestsinternalOnly(), includeSubdomains(), shouldCrawl()depth(), concurrency(), delay(), limit(), userAgent()FixedDelayThrottle and AdaptiveThrottlealsoExtract(), extractAll(), ResourceType enumArrayCrawlQueuealwaysCrawl() and neverCrawl() pattern overridesretry() for automatic retries on connection errors and 5xx responsesTransferStatistics with typed timing accessorsCloudflareRenderer for JavaScript renderingJavaScriptRenderer interface for custom renderersbasicAuth(), token(), withoutVerifying(), proxy(), cookies(), queryParameters(), middleware()CrawlUrl::create() static factory (use new CrawlUrl(...) instead)Spatie\Crawler\Url classResponseWithCachedBody (replaced by CrawlResponse)nicmart/tree dependencyspatie/browsershot as a required dependency (moved to suggest)setBrowsershot() and getBrowsershot() methodsstartCrawling() method (use start())setUrlParserClass() (use parseSitemaps() or pass a UrlParser directly)Add Laravel 13 support
Full Changelog: https://github.com/spatie/crawler/compare/8.4.6...8.4.7
Full Changelog: https://github.com/spatie/crawler/compare/8.4.4...8.4.5
Full Changelog: https://github.com/spatie/crawler/compare/8.4.3...8.4.4
Full Changelog: https://github.com/spatie/crawler/compare/8.4.2...8.4.3
Full Changelog: https://github.com/spatie/crawler/compare/8.4.1...8.4.2
Full Changelog: https://github.com/spatie/crawler/compare/8.4.0...8.4.1
Full Changelog: https://github.com/spatie/crawler/compare/8.3.1...8.4.0
Full Changelog: https://github.com/spatie/crawler/compare/8.3.0...8.3.1
Full Changelog: https://github.com/spatie/crawler/compare/8.2.3...8.3.0
Full Changelog: https://github.com/spatie/crawler/compare/8.2.2...8.2.3
Full Changelog: https://github.com/spatie/crawler/compare/8.2.0...8.2.1
Full Changelog: https://github.com/spatie/crawler/compare/8.1.0...8.2.0
Full Changelog: https://github.com/spatie/crawler/compare/8.0.4...8.1.0
Full Changelog: https://github.com/spatie/crawler/compare/8.0.2...8.0.3
Full Changelog: https://github.com/spatie/crawler/compare/8.0.1...8.0.2
Full Changelog: https://github.com/spatie/crawler/compare/8.0.0...8.0.1
Full Changelog: https://github.com/spatie/crawler/compare/7.1.1...7.1.2
setCurrentCrawlLimit and setTotalCrawlLimitArrayCrawlQueue (#326)setParseableMimeTypes() (#293)CrawlRequestFailed receives an exception other than RequestExceptionhasAlreadyBeenProcessedTHIS VERSION CONTAINS A CRITICAL BUG, DO NOT USE
ArrayCrawlQueue; this is now the default queueCollectionCrawlQueuedelayBetweenRequests now uses int instead of float everywheregetUrls and getPendingUrlsnoindex,follow urls.setDelayBetweenRequests$defaultClientOptionsspatie/robots-txt to 1.0.1.rel set to nofollowIlluminate's and Tighten's Collection.CrawlObserver and CrawlProfile are upgraded from interfaces to abstract classestel: linkssetCrawlObservers, addCrawlObserversetMaximumResponseSize (someday we'll get this right)CONTAINS BUGS, DO NOT USE THIS VERSION
setMaximumResponseSizeCONTAINS BUGS, DO NOT USE THIS VERSION
setMaximumResponseSizeCONTAINS BUGS, DO NOT USE THIS VERSION
setMaximumResponseSize\Psr\Http\Message\UriInterface for all urlsCrawlSubdomains profileEmptyCrawlObserverlink functionCrawlInternalUrlstel: links when crawlingpath, segment and segments functions to UrlFull Changelog: https://github.com/spatie/crawler/compare/7.0.4...7.0.5
How can I help you explore Laravel packages today?