Spring AI with Llama · Chapter 13

Streaming API: Real-Time Token-by-Token Responses

⚠️ Draft — This chapter is a work in progress. Code snippets have not yet been validated against the running codebase and may need fixes before use.

What you will build: A live-streaming HR chat interface — instead of waiting 10 seconds for a full response, employees see the answer appearing word-by-word as Llama generates it, just like ChatGPT.


The Problem We Are Solving

The SmartHR bot works well but feels slow. For long answers, the UI shows a spinner for 8-10 seconds, then the full text appears at once. Employees complain it feels unresponsive.

Dev tells Sarah:

"The model is actually generating the answer token-by-token the whole time. We're just not showing it until it finishes. We can stream it instead."


What You Will Learn


Blocking vs Streaming

// Blocking — waits for the full response (current approach)
String answer = chatClient.prompt().user(question).call().content();
// User waits 8 seconds → sees full text at once

// Streaming — emits each token as it is generated
Flux<String> stream = chatClient.prompt().user(question).stream().content();
// User sees text appear word-by-word immediately

Building a Streaming Endpoint (SSE)

Server-Sent Events (SSE) is the standard way to push streaming data from a Spring Boot controller to a browser.

@GetMapping(value = "/hr/ask/stream", produces = MediaType.TEXT_EVENT_STREAM_VALUE)
public Flux<String> streamAnswer(@RequestParam String question) {
    return chatClient
            .prompt()
            .user(question)
            .stream()
            .content();
}

That is the entire endpoint. Spring's Flux<String> and TEXT_EVENT_STREAM_VALUE do the rest.


Consuming the Stream in a Browser

const source = new EventSource(
    '/hr/ask/stream?question=What+is+the+leave+policy%3F'
);

source.onmessage = (event) => {
    document.getElementById('answer').innerText += event.data;
};

source.onerror = () => source.close();

Streaming with Metadata

@GetMapping(value = "/hr/ask/stream/full", produces = MediaType.TEXT_EVENT_STREAM_VALUE)
public Flux<ChatResponse> streamFull(@RequestParam String question) {
    return chatClient
            .prompt()
            .user(question)
            .stream()
            .chatResponse();  // full ChatResponse per token, includes metadata
}

The full ChatResponse stream includes finish reason, token usage, and model metadata alongside each token chunk.


Error Handling in Streams

return chatClient.prompt().user(question)
        .stream()
        .content()
        .onErrorResume(e -> Flux.just("[Error: " + e.getMessage() + "]"));

When to Use Streaming

Use Streaming Use Blocking
UI-facing chat interfaces Backend-to-backend calls
Long answers (>3 seconds) Short answers (<1 second)
When UX responsiveness matters When you need the full text before processing
Real-time dashboards Batch processing

Summary

In this chapter you will:


What's Next

In Chapter 14, we add document intelligence — the ability to ingest PDFs, Word documents, and web pages, making the SmartHR assistant able to read and analyse any document Sarah uploads.

Code for this chapter: code/chapter-13-streaming-api/