YOUR FIRST VISION REQUEST
Let's put it together and actually ask a vision model to describe an image. We'll use curl, a command-line tool for making web requests.
The Ollama API
Ollama provides an HTTP API. You send a POST request to:
http://localhost:11434/api/generate
The request body is JSON containing:
model Which AI model to use
prompt The question or instruction (text)
images An array of Base64-encoded images
stream Whether to stream the response (we'll use false)
The JSON Format
Here's what a complete request looks like:
{
"model": "ministral",
"prompt": "What do you see?",
"images": ["base64_encoded_data_here..."],
"stream": false
}
The images field is an array (list) because you could theoretically send multiple images. For our purposes, we'll send one at a time.
Step 1: Encode the Image
First, save the encoded image to a shell variable:
IMAGE_B64=$(base64 -i yourimage.png | perl -pe's~\s~~g')
You won't see any output, but the variable now contains the encoded image. Verify with:
echo ${#IMAGE_B64}
You should see a large number (the string length). If it's over 800,000, resize the image first with sips.
Step 2: Send the Request
Now we send a request to Ollama using curl:
curl -s http://localhost:11434/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "ministral",
"prompt": "Describe what you see in this image.",
"images": ["'"$IMAGE_B64"'"],
"stream": false
}'
Breaking down the curl command:
-s Silent mode (no progress bar)
http://localhost:11434 Ollama's address
/api/generate The endpoint for generating responses
-H "Content-Type:..." Tell the server we're sending JSON
-d '{...}' The JSON data to send
The tricky part is '"$IMAGE_B64"' which is shell syntax magic to insert our variable into the JSON string. The sequence '", then $IMAGE_B64, then "' breaks out of the single-quoted JSON, inserts the variable, and goes back in.
Step 3: Parse the Response
The response comes back as JSON:
{
"model": "ministral",
"response": "The image shows a woman with long brown hair...",
"done": true,
"total_duration": 12345678
}
The part we care about is the "response" field. We extract it using jq:
curl -s http://localhost:11434/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "ministral",
"prompt": "Describe what you see in this image.",
"images": ["'"$IMAGE_B64"'"],
"stream": false
}' | jq -r '.response'
The jq -r '.response' part extracts just the response text. The -r flag gives "raw" output without quotes.
If you don't have jq installed:
macOS: brew install jq
Ubuntu: sudo apt install jq
Try It Yourself
Complete sequence:
# 1. Encode an image (use any PNG you have)
IMAGE_B64=$(base64 -i yourimage.png | perl -pe's~\s~~g')
# 2. Ask the vision model
curl -s http://localhost:11434/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "ministral",
"prompt": "Describe what you see in this image in 2-3 sentences.",
"images": ["'"$IMAGE_B64"'"],
"stream": false
}' | jq -r '.response'
Try different images. Notice how the model describes various subjects, styles, and compositions.