Performance and Caching: Handling Scale Efficiently
⚠️ Draft — This chapter is a work in progress. Code snippets have not yet been validated against the running codebase and may need fixes before use.
What you will build: A bulk job description generator — Lisa uploads a spreadsheet of 50 open roles and the system generates polished job descriptions for all of them concurrently, with caching to avoid regenerating identical prompts.
The Problem We Are Solving
The SmartHR bot works great for one user at a time. But TechCorp is growing. Dev sees a new requirement:
"We need to generate job descriptions for 50 open roles before the recruitment drive next week. Can we do them all at once?"
A naive sequential approach would take 50 × 8 seconds = ~7 minutes. With concurrency and caching, it takes under a minute.
What You Will Learn
- Why AI calls are slow and where the time goes
- Concurrent AI calls with virtual threads (Java 21)
- Response caching for repeated or similar prompts
- Batch processing patterns for bulk AI tasks
- Rate limiting to protect Ollama from overload
Where the Time Goes
Total time for one AI call (~8 seconds):
├── Network to Ollama: ~10ms (negligible — it's localhost)
├── Token generation: ~7500ms ← this is the bottleneck
└── Response parsing: ~10ms (negligible)
The model generates tokens one at a time. There is no way to make a single call faster. But you can run many calls at the same time.
Concurrent Calls with Virtual Threads
Java 21's virtual threads make concurrent AI calls trivial:
@PostMapping("/hr/jobs/generate-bulk")
public List<JobDescription> generateBulk(@RequestBody List<JobRole> roles) {
return roles.parallelStream() // virtual threads handle concurrency
.map(this::generateJobDescription)
.toList();
}
private JobDescription generateJobDescription(JobRole role) {
String content = chatClient.prompt()
.user("""
Write a professional job description for:
Role: {title}
Department: {department}
Level: {level}
Key skills: {skills}
""")
// ... params
.call()
.content();
return new JobDescription(role, content);
}
Enable virtual threads in application.yml:
spring:
threads:
virtual:
enabled: true
Caching Repeated Prompts
If the same prompt is sent multiple times (e.g., the same system prompt on every call), cache the response:
@Service
public class CachedChatService {
private final Cache<String, String> cache = Caffeine.newBuilder()
.maximumSize(500)
.expireAfterWrite(Duration.ofHours(1))
.build();
public String ask(String question) {
return cache.get(question, q -> chatClient.prompt()
.user(q)
.call()
.content());
}
}
Add Caffeine to pom.xml:
<dependency>
<groupId>com.github.ben-manes.caffeine</groupId>
<artifactId>caffeine</artifactId>
</dependency>
Rate Limiting Ollama
Too many concurrent requests will make Ollama queue or crash. Use a semaphore to limit concurrency:
private final Semaphore semaphore = new Semaphore(3); // max 3 concurrent calls
public String ask(String question) throws InterruptedException {
semaphore.acquire();
try {
return chatClient.prompt().user(question).call().content();
} finally {
semaphore.release();
}
}
3 concurrent Llama calls is a reasonable default for a machine with 16GB RAM running llama3.2.
Performance Benchmarks
| Strategy | 50 job descriptions | Notes |
|---|---|---|
| Sequential | ~7 minutes | One at a time |
| Parallel (3 threads) | ~2.5 minutes | 3× faster |
| Parallel (5 threads) | ~1.5 minutes | Diminishing returns |
| With caching (repeated) | ~5 seconds | Cache hit, no model call |
Summary
In this chapter you will:
- Use Java 21 virtual threads for concurrent AI calls
- Cache repeated prompt responses with Caffeine
- Rate-limit Ollama to prevent overload
- Build a bulk job description generator that processes 50 roles in parallel
What's Next
In Chapter 19, we harden the SmartHR bot for production — handling prompt injection attacks, sanitising user input, protecting PII, and ensuring the bot cannot be misused by malicious employees.
Code for this chapter: code/chapter-18-performance-and-caching/