STRUCTURED PROMPTING
Now here's the challenge. You can ask a vision model to describe an image, but how do you automate decisions based on the response?
The Problem with Vague Prompts
If you ask a vision model "Is this appropriate?" you'll get something like:
"The image appears to be generally appropriate, though it depends on
the context. There are no explicit elements visible, however some
viewers might consider the pose to be somewhat suggestive. In a
professional setting, this might not be ideal, but for artistic
purposes it could be acceptable..."
That's useless for automation. We can't easily parse "it depends" into a yes/no decision. We need structured output.
The Solution: Force Structure
Instead of vague questions, ask specific, concrete questions that require YES or NO answers. Then ask the model to format its response in a predictable way.
Bad prompt:
"Is this image appropriate?"
Good prompt:
"Answer YES or NO only.
NUDE_CHEST: Is bare chest visible?
NUDE_CHEST: YES or NO"
Now the response looks like:
"NUDE_CHEST: NO"
That's something we can parse programmatically.
A Structured Description Prompt
First, let's see how to get organized descriptions. Instead of free-form text, we ask for specific categories:
Describe this image. Use plain text only. DO NOT use asterisks, bold,
or any markdown formatting.
List these aspects with a dash prefix:
- SUBJECT:
- SETTING:
- COLORS:
- CLOTHING:
- OBJECTS:
- MOOD:
Keep each line brief. No sub-bullets.
The response looks like:
- SUBJECT: A young woman with long dark hair
- SETTING: Urban rooftop at sunset
- COLORS: Orange, purple, and deep blue tones
- CLOTHING: Casual hoodie and jeans
- OBJECTS: Coffee mug in hand
- MOOD: Contemplative and peaceful
That's structured data we can parse. Each line starts with a known key.
A Moderation Prompt
For content moderation, we need YES/NO answers we can act on:
Answer YES or NO only for each question.
1. NUDE_CHEST: Is bare chest, breasts, or nipples visible?
2. NUDE_LOWER: Is bare buttocks or genitals visible?
3. WEAPON: Is the person holding a weapon (gun, knife, etc)?
NUDE_CHEST: YES or NO
NUDE_LOWER: YES or NO
WEAPON: YES or NO
Notice the structure:
- Clear instructions ("Answer YES or NO only")
- Numbered questions for clarity
- Labeled output format (KEY: YES or NO)
- Concrete things to look for
The response will look like:
NUDE_CHEST: NO
NUDE_LOWER: NO
WEAPON: YES
Why Concrete Questions Work
Vision models (especially smaller ones) do better when you ask about things that ARE there rather than things that AREN'T there.
Good: "Is bare chest visible?" (looking for presence)
Bad: "Is there any nudity?" (vague, looking for absence)
It's easier for the model to confirm "yes, I see X" than to exhaustively verify "no, there is nothing inappropriate anywhere."
Let's Test It
IMAGE_B64=$(base64 -i yourimage.png | perl -pe's~\s~~g')
curl -s http://localhost:11434/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "ministral",
"prompt": "Answer YES or NO only for each question.\n1. NUDE_CHEST: Is bare chest, breasts, or nipples visible?\n2. NUDE_LOWER: Is bare buttocks or genitals visible?\n3. WEAPON: Is the person holding a weapon (gun, knife, etc)?\nNUDE_CHEST: YES or NO\nNUDE_LOWER: YES or NO\nWEAPON: YES or NO",
"images": ["'"$IMAGE_B64"'"],
"stream": false
}' | jq -r '.response'
Try this on several images. Notice how the responses follow the structured format we requested.
Tuning Your Prompts
The prompt is the programmable part. By changing the prompt, you change what the model looks for without changing any code.
If you're getting inconsistent results:
- Make questions more specific
- Use simpler language
- Reduce the number of questions
- Lower the temperature setting (more on this later)