Spring AI with Llama · Chapter 17

Evaluation: Testing and Scoring AI Responses

⚠️ Draft — This chapter is a work in progress. Code snippets have not yet been validated against the running codebase and may need fixes before use.

What you will build: An automated QA pipeline — a test suite that sends HR questions to the SmartHR bot, uses a second AI call to score each answer on accuracy, relevance, and tone, and fails the build if quality drops below a threshold.

The Problem We Are Solving

Dev has been adding features for weeks. Sarah suddenly reports:

"The bot started giving weird answers this week. Did something change?"

Dev has no idea. There are no tests for AI output quality. A change to the system prompt or model version quietly degraded the answers and nobody caught it.

The fix: automated AI evaluation.

What You Will Learn

Why traditional unit tests do not work for AI output
The "AI as judge" pattern
Spring AI's built-in evaluators
How to write an AI quality test suite
How to set score thresholds and fail the build

Why Unit Tests Do Not Work for AI

// This test is useless for AI
@Test
void shouldAnswerLeaveQuestion() {
    String answer = chatClient.prompt()
            .user("How many leave days do I get?")
            .call().content();

    assertEquals("You get 15 days", answer);  // ❌ fails every time — output is never identical
}

AI output is non-deterministic. You cannot assert exact strings. You need to evaluate the quality of the answer, not the exact text.

The AI-as-Judge Pattern

Use a second AI call to evaluate the first:

Question: "How many vacation days do new employees get?"
Answer:   "At TechCorp, new employees receive 15 days of paid vacation..."

Judge prompt:
  "Rate this answer on a scale of 1-5 for:
   - Accuracy (does it answer the question?)
   - Relevance (is it on topic?)
   - Tone (is it professional?)
   Return JSON: {accuracy: X, relevance: X, tone: X, feedback: '...'}"

Spring AI's Built-In Evaluators

// Relevance evaluator — is the answer relevant to the question?
RelevancyEvaluator evaluator = new RelevancyEvaluator(ChatClient.builder(chatModel));

EvaluationResponse result = evaluator.evaluate(
        new EvaluationRequest(question, List.of(context), answer)
);

boolean isRelevant = result.isPass();
float score = result.getScore();

What You Will Build — Evaluation Test Suite

@SpringBootTest
class HrChatQualityTest {

    @Autowired ChatClient chatClient;
    @Autowired ChatModel chatModel;

    RelevancyEvaluator evaluator;

    @BeforeEach
    void setup() {
        evaluator = new RelevancyEvaluator(ChatClient.builder(chatModel));
    }

    @ParameterizedTest
    @MethodSource("hrQuestions")
    void shouldAnswerHrQuestionsRelevantly(String question, String context) {
        String answer = chatClient.prompt().user(question).call().content();

        EvaluationResponse result = evaluator.evaluate(
                new EvaluationRequest(question, List.of(context), answer));

        assertThat(result.getScore())
                .as("Answer quality too low for question: " + question)
                .isGreaterThanOrEqualTo(0.7f);
    }

    static Stream<Arguments> hrQuestions() {
        return Stream.of(
                Arguments.of("How many vacation days do new employees get?",
                             "TechCorp policy: 15 days for first year"),
                Arguments.of("What is the notice period for resignation?",
                             "TechCorp policy: 30 days notice required"),
                Arguments.of("When does health insurance start?",
                             "TechCorp policy: Health insurance from day one")
        );
    }
}

Evaluation Metrics to Track

Metric	What it measures	Acceptable threshold
Relevancy	Is the answer on-topic?	≥ 0.7
Faithfulness	Does it stick to facts (not hallucinate)?	≥ 0.8
Completeness	Does it address the full question?	≥ 0.65
Tone	Is it professional?	≥ 0.75

Summary

In this chapter you will:

Understand why deterministic unit tests fail for AI and what to do instead
Use the AI-as-judge pattern to evaluate answer quality
Use Spring AI's RelevancyEvaluator
Build a parameterised test suite that fails the build if quality drops

What's Next

In Chapter 18, we focus on performance — prompt caching, batch processing, and how to handle hundreds of AI requests efficiently without overwhelming Ollama.

Code for this chapter: code/chapter-17-evaluation/

← Chapter 16: AI Agents: Autonomous Workflows and Tool Chaining Next: Chapter 18: Performance and Caching: Handling Scale Efficiently →