Skip to main content

Metrics overview

RAGE4j-Core provides six primary metrics for evaluating LLM responses. Each metric focuses on a different aspect of response quality and is implemented through a dedicated evaluator.

Basic understanding of the metrics

  • Answer Relevance (0-1): Higher scores indicate the answer better addresses the original question.

  • Answer Correctness (0-1): F1 score where 1 indicates perfect alignment with ground truth.

  • Faithfulness (0-1): Proportion of answer claims that can be supported by the context.

  • Semantic Similarity (0-1): Cosine similarity between answer and ground truth embeddings.

  • Bleu score (0-1): Precision-based metric. Measures exact n-gram overlap between the answer and ground truth. Higher values indicate more literal matching.

  • Rouge score (0-1): Measures overlap between the answer and reference using multiple variants:

    • ROUGE-1 (unigrams): Measures key word coverage.
    • ROUGE-2 (bigrams): Captures word order and fluency.
    • ROUGE-L: Longest common subsequence (LCS) at the sentence level.
    • ROUGE-Lsum: LCS across sentence pairs (summary-level).

    Each variant provides Precision, Recall, and F1 scores.