THE TECHNOLOGY STACK
Before we dive in, let's understand the tools we're using.
What is Ollama?
Ollama is a program that runs AI language models on your own computer. Think of it as a local server that speaks AI. Instead of sending your data to OpenAI or Google, everything stays on your machine.
Ollama runs in the background and listens for requests on port 11434. You send it a question (called a "prompt"), and it sends back an answer.
Website: https://ollama.com
Installing Ollama
macOS or Linux:
curl -fsSL https://ollama.com/install.sh | sh
Windows:
Download the installer from https://ollama.com/download
After installation, start the Ollama server:
ollama serve
This runs in the foreground. Open a new terminal for other commands. You can also run it as a background service, but for learning, running it in a visible terminal helps you see what's happening.
Verifying Ollama is Running
In a new terminal, run:
curl http://localhost:11434/api/tags
If Ollama is running, you'll see a list of installed models (or an empty list if you haven't downloaded any yet). If you get "connection refused," the server isn't running.
What is a Vision Model?
A regular language model (like ChatGPT) only understands text. You type words, it responds with words.
A vision model can understand both text AND images. You can show it a picture and ask "What do you see?" and it will describe the image in words.
Under the hood, vision models are trained on millions of image-text pairs. They learned to connect visual patterns with language. When you send an image, the model converts it into a numerical representation and processes it alongside your text prompt.
Getting a Vision Model
Not all Ollama models can see images. You need one specifically trained for vision. Here are some options:
ollama pull ministral # Mistral's vision model (what we use)
ollama pull llava # Popular open-source option
ollama pull minicpm-v # Lightweight alternative
For this tutorial, we'll use "ministral" but any vision model works.
The download might take a few minutes depending on your connection. Models are typically 4-14 GB in size.
Checking Your Models
See what models you have installed:
ollama list
You'll see something like:
NAME SIZE
ministral:latest 8.6 GB
llama3:latest 4.7 GB
The vision-capable models will work with images. Regular text models will give an error if you try to send them an image.
Quick Test
Make sure everything is working:
# In one terminal
ollama serve
# In another terminal
curl http://localhost:11434/api/tags | head
If you see JSON output listing your models, you're ready to continue.