Spring AI with Llama · Chapter 12

Multimodality: Images and Text Together

⚠️ Draft — This chapter is a work in progress. Code snippets have not yet been validated against the running codebase and may need fixes before use.

What you will build: A workplace safety reporter — an employee uploads a photo of a potential workplace hazard and the AI analyses the image, identifies the risk, and auto-generates a formal safety incident report.


The Problem We Are Solving

TechCorp's office manager reports that employees spot hazards (a broken chair, a trailing cable, a blocked fire exit) but never bother filing the paperwork. Sarah wants a faster way.

"What if someone could just take a photo and the AI fills in the report for them?"

Multimodal AI — models that understand both text and images — makes this possible.


What You Will Learn


Which Models Support Vision?

Not all Ollama models can process images. Vision-capable models:

ollama pull llava          # LLaVA — the most popular vision model
ollama pull llava:13b      # larger, better quality
ollama pull moondream      # lightweight vision model
ollama pull llama3.2-vision  # Llama 3.2 with vision capability
# application.yml — switch to a vision model
spring:
  ai:
    ollama:
      chat:
        options:
          model: llava

How to Send an Image in Spring AI

@PostMapping("/safety/analyse")
public SafetyReport analyseSafety(
        @RequestParam("image") MultipartFile image,
        @RequestParam("location") String location) throws IOException {

    // Convert uploaded file to Spring AI media object
    var media = new Media(MimeTypeUtils.IMAGE_JPEG,
                          new ByteArrayResource(image.getBytes()));

    String analysis = chatClient
            .prompt()
            .user(u -> u
                .text("""
                      Analyse this workplace photo for safety hazards.
                      Location: {location}

                      Identify:
                      1. Any visible hazards
                      2. Risk level (LOW / MEDIUM / HIGH)
                      3. Recommended immediate action
                      4. Whether this requires an official incident report
                      """)
                .param("location", location)
                .media(media))
            .call()
            .content();

    return buildReport(location, analysis);
}

Structured Safety Report Output

public record SafetyReport(
        String location,
        String hazardDescription,
        String riskLevel,         // LOW / MEDIUM / HIGH
        String recommendedAction,
        boolean requiresIncidentReport,
        LocalDateTime reportedAt
) {}

Combine with BeanOutputConverter from Chapter 5 to get a fully typed SafetyReport object instead of raw text.


Limitations to Know

Limitation Detail
Model must be vision-capable llava, llama3.2-vision — not all models
Image size Keep under 5MB for reasonable speed
Accuracy Vision models are good but not perfect — always have a human review HIGH risk reports
Local only Ollama vision models run locally — image data never leaves your machine

Summary

In this chapter you will:


What's Next

In Chapter 13, we tackle streaming — instead of waiting for the full response, we stream tokens as they are generated, giving users the live-typing experience they expect from modern AI interfaces.

Code for this chapter: code/chapter-09-multimodality/