Getting Started

Begin by installing the package via Composer:

composer require yooper/php-text-analysis

The most approachable entry point is its global-friendly helper functions (e.g., tokenize(), freq_dist(), stem()). Start with basic text preprocessing and analysis:

use TextAnalysis\Functions\AnalysisFunctions as text;

$text = "Laravel makes PHP development joyful!";
$tokens = text::tokenize($text);
$freqDist = text::freq_dist($tokens);
$keywords = array_keys($freqDist->top(3));

These functions require no setup and work out of the box — perfect for adding quick analysis (e.g., keyword frequency, n-gram trends) toartisan commands, jobs, or controllers.

Implementation Patterns

Preprocessing pipelines: Chain methods to build reusable analysis workflows. For example, preprocess user reviews:

$tokens = text::tokenize($review);
$tokens = text::normalize_tokens($tokens, 'mb_strtolower');
$tokens = text::stem($tokens);
$bigrams = text::ngrams($tokens, 2);

Sentiment & classification in batch: Process data asynchronously (e.g., in queued jobs) using vader() for sentiment or naive_bayes() for classification — both accept token arrays directly after normalization.
Laravel integration:
- Inject helpers into services for dependency injection (e.g., wrap text::rake() in a KeywordExtractor service).
- Use text::naive_bayes() in a classifier service with persistent training: cache trained models via serialize() for reuse across requests.
Extensible stems/tokens: Swap stemmers/tokenizers dynamically based on content type (e.g., use SentenceTokenizer for legal docs, PorterStemmer for generic content).
Corpus analysis: Aggregate documents into a Corpus object for TF-IDF or lexical diversity (e.g., measure uniqueness in product descriptions).

Gotchas and Tips

VADER breaks on short inputs: Avoid calling vader() with fewer than ~3 tokens — validate or return neutral sentiment early to prevent errors (v1.4.1 fixed some edge cases but not all).
Stemming assumes lowercase: Always normalize tokens before stemming — mis-pairing "RUNNING" and "run" if case isn’t handled.
Rake expects clean tokens: Stopwords, punctuation, and numbers must be removed before passing to rake() — otherwise, scores become unreliable. Use built-in StopWords classes or text::normalize_tokens() first.
Naive Bayes training is stateful: Each naive_bayes() instance learns incrementally. Store the trained classifier (e.g., in cache) — don’t retrain on every request.
N-gram delimiters are literal: ngrams($tokens, 3, '_') produces token1_token2_token3 — ensure downstream logic (e.g., search) expects underscored forms.
Debug tip: Use print_r($freqDist) or var_dump($tokens) to inspect internal structures — many classes lack __toString() and rely on array access.

Php Text Analysis Laravel Package

Getting Started

Implementation Patterns

Gotchas and Tips