Running and Comparing Multiple Models with Ollama


The Problem We Are Solving

Sarah has heard that there are other AI models besides Llama. A colleague told her that mistral is faster, and codellama is better at technical questions. She asks Dev:

"Can we try different models without rewriting the whole app? And can we see how they compare?"

The answer is yes — and it only takes one line of config.


What You Will Learn


Ollama Model Management

# Pull models
ollama pull llama3.2       # default — good all-rounder
ollama pull mistral        # fast, strong reasoning
ollama pull codellama      # optimised for code
ollama pull phi3:mini      # tiny — runs on 4GB RAM
ollama pull gemma2         # Google's open model

# List what you have
ollama list

# Remove a model
ollama rm mistral

Switching Models via Config (No Code Changes)

# application.yml — change this one line to switch models
spring:
  ai:
    ollama:
      chat:
        options:
          model: mistral

Restart the app. Every endpoint now uses mistral. No Java changes.


Switching Models Per-Request

// Override the default model for a single call using Ollama-specific options
// (full implementation covered in this chapter)
String answer = chatClient
        .prompt()
        .user(request.question())
        .options(/* Ollama model options with .model("mistral") */)
        .call()
        .content();

What You Will Build — Model Comparison Endpoint

// POST /hr/ask/compare
// Sends the same question to two models and returns both answers

public record CompareRequest(String question, String modelA, String modelB) {}
public record CompareResponse(String question, String answerA, String answerB,
                               String modelA, String modelB) {}

@PostMapping("/ask/compare")
public CompareResponse compare(@RequestBody CompareRequest request) {
    String answerA = askWithModel(request.question(), request.modelA());
    String answerB = askWithModel(request.question(), request.modelB());
    return new CompareResponse(request.question(), answerA, answerB,
                               request.modelA(), request.modelB());
}

private String askWithModel(String question, String model) {
    return chatClient.prompt()
            .user(question)
            .options(/* Ollama-specific options with .model(model) — covered in this chapter */)
            .call()
            .content();
}

Test it:

curl -s -X POST http://localhost:8080/hr/ask/compare \
  -H "Content-Type: application/json" \
  -d '{"question": "What is a good onboarding plan for a new software engineer?",
       "modelA": "llama3.2",
       "modelB": "mistral"}'

Model Cheat Sheet

Model RAM Required Best for Speed
llama3.2 (3B) 4GB General Q&A Fast
llama3.2:8b 8GB Better reasoning Medium
mistral 5GB Instruction following Fast
codellama 4GB Code, technical docs Fast
phi3:mini 2GB Lightweight, constrained env Very fast
gemma2 5GB Multi-language Medium

Summary

In this chapter you will:


What's Next

In Chapter 4, we tackle prompt engineering — using PromptTemplate to inject dynamic data (employee name, department, role) into prompts so the HR assistant gives personalised, TechCorp-branded responses.

Code for this chapter: code/chapter-03-comparing-models/