artryazanov/laravel-wikipedia-games-db
Laravel package that builds a normalized video games database by scraping Wikipedia. Queue-driven and resumable, traverses categories, parses infoboxes via MediaWiki API + HTML, stores many-to-many relations with wikipedia_* tables, configurable via .env.
Installation
composer require artryazanov/laravel-wikipedia-games-db
php artisan vendor:publish --provider="ArtRyazanov\WikipediaGamesDb\WikipediaGamesDbServiceProvider" --tag="config"
Configure
Edit config/wikipedia-games-db.php to set:
api_endpoint (default: https://en.wikipedia.org/w/api.php)queue_connection (e.g., database, redis)game_model (default: App\Models\Game)Run Migrations
php artisan migrate
First Use Case: Scrape a Category Dispatch a job to scrape a Wikipedia category (e.g., "Action video games"):
use ArtRyazanov\WikipediaGamesDb\Jobs\ScrapeCategory;
ScrapeCategory::dispatch('Action video games');
Process Queue
php artisan queue:work
Category Traversal
Use ScrapeCategory to recursively scrape all games in a Wikipedia category tree.
ScrapeCategory::dispatch('Video game genres');
Game Parsing The package auto-parses game pages via:
// Manually trigger parsing for a specific game title
use ArtRyazanov\WikipediaGamesDb\Jobs\ParseGame;
ParseGame::dispatch('The Legend of Zelda');
Data Normalization
Extend App\Models\Game to map Wikipedia fields to your schema:
// Example: Cast infobox fields to model attributes
protected $casts = [
'release_year' => 'integer',
'developer' => 'array',
];
Queue Management
ScrapeCategory::dispatch('Category', ['batch_size' => 50]) to limit API calls..env):
QUEUE_WORKER_RETRIES=3
Integration with Existing Data
// Sync parsed games with your DB
$game = Game::firstOrCreate(
['title' => 'Super Mario Bros.'],
[
'developer' => ['Nintendo'],
'release_year' => 1985,
]
);
Custom Field Mapping
Override the parser’s default mappings in config/wikipedia-games-db.php:
'field_mappings' => [
'infobox_developer' => 'developers',
'infobox_publisher' => 'publishers',
],
API Rate Limiting
Throttle requests via middleware (extend WikipediaGamesDbServiceProvider):
$router->middleware('throttle:wikipedia-api', function ($request) {
return $request->ip() !== '127.0.0.1';
});
Webhooks for New Games
Listen for game.parsed events in EventServiceProvider:
protected $listen = [
'ArtRyazanov\WikipediaGamesDb\Events\GameParsed' => [
'App\Listeners\NotifySlack',
],
];
Hybrid Scraping Combine with manual data entry for high-value games:
// Skip API parsing for a game (e.g., "Half-Life")
Game::updateOrCreate(
['title' => 'Half-Life'],
['manually_verified' => true]
);
API Quotas
503 errors.Circular Categories Categories like "Video game genres" may reference themselves, causing infinite loops.
visited_categories table or use Laravel’s queue:failed to monitor stuck jobs.Infobox Inconsistency Not all games have infoboxes, leading to partial data.
// In a listener for GameParsed
if (empty($game->developer)) {
logger()->warning("Missing developer for {$game->title}");
}
Title Ambiguity Games with similar names (e.g., "Resident Evil" vs. "Resident Evil 2") may merge incorrectly.
title + release_year as a composite key.Queue Stalling Large categories (e.g., "List of video games") may stall the queue.
ScrapeCategory::dispatch('List of video games', ['chunk_size' => 100]);
Job Logging Enable Laravel’s queue logging:
QUEUE_LOG=true
Check storage/logs/laravel.log for failed jobs.
API Debugging Inspect raw API responses by temporarily overriding the API client:
// In a service provider
$this->app->singleton(\ArtRyazanov\WikipediaGamesDb\Services\WikipediaApi::class, function () {
return new \ArtRyazanov\WikipediaGamesDb\Services\WikipediaApi(
new \ArtRyazanov\WikipediaGamesDb\Services\DebugApiClient
);
});
Database Constraints Add indexes to speed up lookups:
Schema::table('games', function (Blueprint $table) {
$table->index('title');
$table->index('release_year');
});
Custom Parsers
Extend ArtRyazanov\WikipediaGamesDb\Parsers\GameParser to handle niche fields:
namespace App\Parsers;
use ArtRyazanov\WikipediaGamesDb\Parsers\GameParser as BaseParser;
class CustomGameParser extends BaseParser {
protected function parsePlatforms($html) {
// Custom logic for parsing platform data
}
}
Register in config/wikipedia-games-db.php:
'parser' => \App\Parsers\CustomGameParser::class,
Web Scraping Fallback
For games missing API data, implement a fallback parser using Goutte:
use Symfony\Component\DomCrawler\Crawler;
$crawler = new Crawler($html);
$platforms = $crawler->filter('.infobox-platform')->text();
Data Validation Add Laravel Validation rules to sanitize scraped data:
use Illuminate\Support\Facades\Validator;
$validator = Validator::make($data, [
'release_year' => 'nullable|integer|min:1950|max:' . (date('Y') + 1),
]);
Export/Import Seed your database from Wikipedia dumps:
// Export parsed games to JSON
$games = Game::all()->toJson();
file_put_contents('games_export.json', $games);
How can I help you explore Laravel packages today?