Examples
Example: Testing Answer Correctness
RageAssert rageAssert = new OpenAiLLMBuilder().fromApiKey(key);
rageAssert.given()
.question(QUESTION)
.groundTruth(GROUND_TRUTH)
.when()
.answer(model::generate)
.then()
.assertAnswerCorrectness(0.7);
This example demonstrates how to use the assertAnswerCorrectness
feature. It checks if the model's generated answer
meets a correctness threshold of 0.7 compared to the defined ground truth.
Example: Testing Faithfulness
RageAssert rageAssert = new OpenAiLLMBuilder().fromApiKey(key);
rageAssert.given()
.question(QUESTION)
.groundTruth(GROUND_TRUTH)
.contextList(List.of(ANSWER))
.when()
.answer(model::generate)
.then()
.assertFaithfulness(0.7);
This example illustrates the use of assertFaithfulness
, ensuring that the
generated answer adheres to the provided
context and retains at least 0.7 faithfulness compared to the ground truth.
Example: Testing Semantic Similarity
RageAssert rageAssert = new OpenAiLLMBuilder().fromApiKey(key);
rageAssert.given()
.question(QUESTION)
.groundTruth(GROUND_TRUTH)
.when()
.answer(model::generate)
.then()
.assertSemanticSimilarity(0.7);
In this example, assertSemanticSimilarity
is used to verify
that the
semantic similarity score between the model's
answer and the ground truth is at least 0.7.
Example: Testing Answer Relevance
RageAssert rageAssert = new OpenAiLLMBuilder().fromApiKey(key);
rageAssert.given()
.question(QUESTION)
.groundTruth(GROUND_TRUTH)
.when()
.answer(model::generate)
.then()
.assertAnswerRelevance(0.7);
This example uses the assertAnswerRelevance
feature, checking that the
model's answer is relevant to the context
provided, with a relevance score of at least 0.7.
Example: Testing Bleu Score
RageAssert rageAssert = new OpenAiLLMBuilder().fromApiKey(key);
rageAssert.given()
.question(QUESTION)
.groundTruth(GROUND_TRUTH)
.when()
.answer(model::generate)
.then()
.assertBleuScore(0.7);
This example uses the assertBleuScore
feature, testing that the exact n-gram
overlap
of the model's answer against a ground truth has a precision of at least 0.7.
Example: Testing Rouge Score
RageAssert rageAssert = new OpenAiLLMBuilder().fromApiKey(key);
rageAssert.given()
.question(QUESTION)
.groundTruth(GROUND_TRUTH)
.when()
.answer(model::generate)
.then()
assertRougeScore(0.9, RougeScoreEvaluator.RougeType.ROUGE_L_SUM, RougeScoreEvaluator.MeasureType.PRECISION));
This example uses the assertRougeScore
feature, with the ROUGE_L_SUM
metric. This example makes sure that the LCS across multiple sentences yields a precision of 0.9.
Example: Concatenation of multiple assertions
RageAssert rageAssert = new OpenAiLLMBuilder().fromApiKey(key);
rageAssert.given()
.question(QUESTION)
.groundTruth(GROUND_TRUTH)
.when()
.answer(model.generate(QUESTION))
.then()
.assertAnswerCorrectness(0.7)
.then()
.assertSemanticSimilarity(0.7);
This example demonstrates how to apply multiple assertions to a single LLM-generated answer. Assertions can be chained, allowing you to combine different evaluation metrics such as correctness and semantic similarity. This is the recommended approach for testing one answer against multiple metrics.