Spring AI with Llama · Chapter 17

Evaluation: Testing and Scoring AI Responses

⚠️ Draft — This chapter is a work in progress. Code snippets have not yet been validated against the running codebase and may need fixes before use.

What you will build: An automated QA pipeline — a test suite that sends HR questions to the SmartHR bot, uses a second AI call to score each answer on accuracy, relevance, and tone, and fails the build if quality drops below a threshold.


The Problem We Are Solving

Dev has been adding features for weeks. Sarah suddenly reports:

"The bot started giving weird answers this week. Did something change?"

Dev has no idea. There are no tests for AI output quality. A change to the system prompt or model version quietly degraded the answers and nobody caught it.

The fix: automated AI evaluation.


What You Will Learn


Why Unit Tests Do Not Work for AI

// This test is useless for AI
@Test
void shouldAnswerLeaveQuestion() {
    String answer = chatClient.prompt()
            .user("How many leave days do I get?")
            .call().content();

    assertEquals("You get 15 days", answer);  // ❌ fails every time — output is never identical
}

AI output is non-deterministic. You cannot assert exact strings. You need to evaluate the quality of the answer, not the exact text.


The AI-as-Judge Pattern

Use a second AI call to evaluate the first:

Question: "How many vacation days do new employees get?"
Answer:   "At TechCorp, new employees receive 15 days of paid vacation..."

Judge prompt:
  "Rate this answer on a scale of 1-5 for:
   - Accuracy (does it answer the question?)
   - Relevance (is it on topic?)
   - Tone (is it professional?)
   Return JSON: {accuracy: X, relevance: X, tone: X, feedback: '...'}"

Spring AI's Built-In Evaluators

// Relevance evaluator — is the answer relevant to the question?
RelevancyEvaluator evaluator = new RelevancyEvaluator(ChatClient.builder(chatModel));

EvaluationResponse result = evaluator.evaluate(
        new EvaluationRequest(question, List.of(context), answer)
);

boolean isRelevant = result.isPass();
float score = result.getScore();

What You Will Build — Evaluation Test Suite

@SpringBootTest
class HrChatQualityTest {

    @Autowired ChatClient chatClient;
    @Autowired ChatModel chatModel;

    RelevancyEvaluator evaluator;

    @BeforeEach
    void setup() {
        evaluator = new RelevancyEvaluator(ChatClient.builder(chatModel));
    }

    @ParameterizedTest
    @MethodSource("hrQuestions")
    void shouldAnswerHrQuestionsRelevantly(String question, String context) {
        String answer = chatClient.prompt().user(question).call().content();

        EvaluationResponse result = evaluator.evaluate(
                new EvaluationRequest(question, List.of(context), answer));

        assertThat(result.getScore())
                .as("Answer quality too low for question: " + question)
                .isGreaterThanOrEqualTo(0.7f);
    }

    static Stream<Arguments> hrQuestions() {
        return Stream.of(
                Arguments.of("How many vacation days do new employees get?",
                             "TechCorp policy: 15 days for first year"),
                Arguments.of("What is the notice period for resignation?",
                             "TechCorp policy: 30 days notice required"),
                Arguments.of("When does health insurance start?",
                             "TechCorp policy: Health insurance from day one")
        );
    }
}

Evaluation Metrics to Track

Metric What it measures Acceptable threshold
Relevancy Is the answer on-topic? ≥ 0.7
Faithfulness Does it stick to facts (not hallucinate)? ≥ 0.8
Completeness Does it address the full question? ≥ 0.65
Tone Is it professional? ≥ 0.75

Summary

In this chapter you will:


What's Next

In Chapter 18, we focus on performance — prompt caching, batch processing, and how to handle hundreds of AI requests efficiently without overwhelming Ollama.

Code for this chapter: code/chapter-17-evaluation/