BLEU Score

Evaluator: BleuScoreEvaluator

The BLEU score evaluates the similarity between an LLM-generated answer and a ground truth reference by computing the weighted geometric mean of clipped n-gram precisions (from unigrams up to 4-grams). It includes a brevity penalty to penalize overly short answers that might otherwise score high due to precision alone.

Modified precision measures how many of the answer’s n-grams appear in the ground truth but limits repeated matches to avoid overcounting.

Required Sample Fields

answer – Required for n-gram overlap comparison.
groundTruth – Required for n-gram overlap comparison.

How It Works

Splits both the answer and the ground truth into tokens based on whitespace.
Calculates modified n-gram overlap precision (with clipping) between the answer and the ground truth for n = 1 to 4.
Computes the geometric mean of these n-gram precisions, with smoothing.
Applies a brevity penalty to reduce the score if the answer is significantly shorter than the ground truth.

Score Interpretation

Range: 0.0 (no overlap) to 1.0 (perfect match)
Higher scores indicate closer word-level similarity between the answer and the reference, with better preservation of multi-word phrases.
Lower scores may indicate missing content, incorrect phrasing, or overly short answers.
BLEU is best suited for comparing structured or templated outputs where exact phrasing matters, but it may not capture semantic similarity well in more flexible, creative text.

Example Usage

BleuScoreEvaluator evaluator = new BleuScoreEvaluator();
Evaluation result = evaluator.evaluate(sample);
double bleuScore = result.getValue();

Required Sample Fields​

How It Works​

Score Interpretation​

Example Usage​

Required Sample Fields

How It Works

Score Interpretation

Example Usage