Spring AI with Llama · Chapter 20

Production Deployment: Docker, Observability, and Going Live

⚠️ Draft — This chapter is a work in progress. Code snippets have not yet been validated against the running codebase and may need fixes before use.

What you will build: A production-ready SmartHR Assistant — Dockerised, health-checked, observable with Micrometer metrics, running Ollama in a container, and deployed behind a reverse proxy. The complete system Sarah can hand to TechCorp's IT team to run.


The Problem We Are Solving

The SmartHR Assistant works perfectly on Dev's laptop. Now TechCorp's IT team needs to run it on a server. They ask:

"How do we deploy this? How do we monitor it? What happens if Ollama crashes? How do we update the Llama model without downtime?"

This chapter answers all of it.


What You Will Learn


Dockerfile

FROM eclipse-temurin:21-jre-alpine

WORKDIR /app
COPY target/smarthr-assistant-*.jar app.jar

# Health check
HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
    CMD wget -qO- http://localhost:8080/actuator/health || exit 1

EXPOSE 8080
ENTRYPOINT ["java", "-jar", "app.jar"]

Docker Compose — Full Stack

# docker-compose.yml
services:
  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama-models:/root/.ollama    # persist downloaded models
    healthcheck:
      test: ["CMD", "ollama", "list"]
      interval: 30s
      timeout: 10s
      retries: 5

  smarthr:
    build: .
    ports:
      - "8080:8080"
    environment:
      SPRING_AI_OLLAMA_BASE_URL: http://ollama:11434
      SPRING_AI_OLLAMA_CHAT_OPTIONS_MODEL: llama3.2
    depends_on:
      ollama:
        condition: service_healthy
    healthcheck:
      test: ["CMD", "wget", "-qO-", "http://localhost:8080/actuator/health"]
      interval: 30s
      timeout: 5s
      retries: 3

  postgres:
    image: pgvector/pgvector:pg16
    environment:
      POSTGRES_DB: smarthr
      POSTGRES_USER: smarthr
      POSTGRES_PASSWORD: ${DB_PASSWORD}
    volumes:
      - postgres-data:/var/lib/postgresql/data

volumes:
  ollama-models:
  postgres-data:

Start the full stack:

docker compose up -d

# Pull the model into the Ollama container (one time)
docker compose exec ollama ollama pull llama3.2

Micrometer Metrics for AI Calls

@Service
public class MeteredChatService {

    private final MeterRegistry registry;
    private final ChatClient chatClient;

    public String ask(String question) {
        return Timer.builder("smarthr.ai.call")
                .tag("endpoint", "hr-ask")
                .register(registry)
                .record(() -> chatClient.prompt().user(question).call().content());
    }
}

Add Actuator and Prometheus:

<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
management.endpoints.web.exposure.include=health,metrics,prometheus
management.endpoint.health.show-details=always

Metrics to monitor: - smarthr.ai.call — response time per endpoint - smarthr.ai.call.count — request volume - jvm.memory.used — heap (grows with chat memory)


Production Configuration

# application-prod.properties

# Ollama — point to container
spring.ai.ollama.base-url=http://ollama:11434

# Conservative defaults for production
spring.ai.ollama.chat.options.temperature=0.3
spring.ai.ollama.chat.options.num-predict=500

# Connection timeouts
spring.ai.ollama.chat.options.timeout=30s

# Database
spring.datasource.url=jdbc:postgresql://postgres:5432/smarthr
spring.datasource.username=smarthr
spring.datasource.password=${DB_PASSWORD}

# Actuator
management.endpoints.web.exposure.include=health,metrics,prometheus

Zero-Downtime Model Updates

# 1. Pull new model while old one is still running
docker compose exec ollama ollama pull llama3.2:8b

# 2. Update application.yml
# spring.ai.ollama.chat.options.model: llama3.2:8b

# 3. Rolling restart
docker compose up -d --no-deps smarthr

Ollama serves the old model until the app restarts — no gap in service.


Production Checklist

Item Chapter Status
Prompt injection guard 16
PII scrubbing from logs 16
Rate limiting 15
HTTPS (TLS termination at proxy) 17
Authentication on endpoints 17
Health checks 17
Metrics and alerts 17
Model pinned to specific version 17
Database backups 17
Log aggregation 17

Summary

In this final chapter you:


The End — What You Built

Starting from a blank Spring Boot project, you built a production-ready AI-powered HR platform:

Chapter 1   → Basic HR Q&A endpoint
Chapter 2   → Controlled, mode-based responses
Chapter 3   → Multi-model support
Chapter 4   → Personalised, template-driven responses
Chapter 5   → Resume parsing to typed Java objects
Chapter 6   → Stateful onboarding chatbot
Chapter 7   → Policy document Q&A with RAG
Chapter 8   → Interview scheduling with function calling
Chapter 9   → Workplace safety analysis with vision
Chapter 10  → Real-time streaming responses
Chapter 11  → Contract intelligence from PDFs
Chapter 12  → Semantic candidate search
Chapter 13  → Autonomous monthly report agent
Chapter 14  → Automated AI quality evaluation
Chapter 15  → Bulk processing and caching
Chapter 16  → Security hardening
Chapter 17  → Production deployment ← You are here

Sarah's Monday mornings are no longer about answering the same questions. They are about reviewing what the SmartHR Assistant accomplished overnight.

Code for this chapter: code/chapter-20-production-deployment/