YOUR FIRST VISION REQUEST

Let's put it together and actually ask a vision model to describe an image. We'll use curl, a command-line tool for making web requests.

The Ollama API

Ollama provides an HTTP API. You send a POST request to:

http://localhost:11434/api/generate

The request body is JSON containing:

model    Which AI model to use
prompt   The question or instruction (text)
images   An array of Base64-encoded images
stream   Whether to stream the response (we'll use false)

The JSON Format

Here's what a complete request looks like:

{
  "model": "ministral",
  "prompt": "What do you see?",
  "images": ["base64_encoded_data_here..."],
  "stream": false
}

The images field is an array (list) because you could theoretically send multiple images. For our purposes, we'll send one at a time.

Step 1: Encode the Image

First, save the encoded image to a shell variable:

IMAGE_B64=$(base64 -i yourimage.png | perl -pe's~\s~~g')

You won't see any output, but the variable now contains the encoded image. Verify with:

echo ${#IMAGE_B64}

You should see a large number (the string length). If it's over 800,000, resize the image first with sips.

Step 2: Send the Request

Now we send a request to Ollama using curl:

curl -s http://localhost:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ministral",
    "prompt": "Describe what you see in this image.",
    "images": ["'"$IMAGE_B64"'"],
    "stream": false
  }'

Breaking down the curl command:

-s                          Silent mode (no progress bar)
http://localhost:11434      Ollama's address
/api/generate               The endpoint for generating responses
-H "Content-Type:..."       Tell the server we're sending JSON
-d '{...}'                  The JSON data to send

The tricky part is '"$IMAGE_B64"' which is shell syntax magic to insert our variable into the JSON string. The sequence '", then $IMAGE_B64, then "' breaks out of the single-quoted JSON, inserts the variable, and goes back in.

Step 3: Parse the Response

The response comes back as JSON:

{
  "model": "ministral",
  "response": "The image shows a woman with long brown hair...",
  "done": true,
  "total_duration": 12345678
}

The part we care about is the "response" field. We extract it using jq:

curl -s http://localhost:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ministral",
    "prompt": "Describe what you see in this image.",
    "images": ["'"$IMAGE_B64"'"],
    "stream": false
  }' | jq -r '.response'

The jq -r '.response' part extracts just the response text. The -r flag gives "raw" output without quotes.

If you don't have jq installed:

macOS:   brew install jq
Ubuntu:  sudo apt install jq

Try It Yourself

Complete sequence:

# 1. Encode an image (use any PNG you have)
IMAGE_B64=$(base64 -i yourimage.png | perl -pe's~\s~~g')

# 2. Ask the vision model
curl -s http://localhost:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ministral",
    "prompt": "Describe what you see in this image in 2-3 sentences.",
    "images": ["'"$IMAGE_B64"'"],
    "stream": false
  }' | jq -r '.response'

Try different images. Notice how the model describes various subjects, styles, and compositions.