STRUCTURED PROMPTING

Now here's the challenge. You can ask a vision model to describe an image, but how do you automate decisions based on the response?

The Problem with Vague Prompts

If you ask a vision model "Is this appropriate?" you'll get something like:

"The image appears to be generally appropriate, though it depends on
 the context. There are no explicit elements visible, however some
 viewers might consider the pose to be somewhat suggestive. In a
 professional setting, this might not be ideal, but for artistic
 purposes it could be acceptable..."

That's useless for automation. We can't easily parse "it depends" into a yes/no decision. We need structured output.

The Solution: Force Structure

Instead of vague questions, ask specific, concrete questions that require YES or NO answers. Then ask the model to format its response in a predictable way.

Bad prompt:

"Is this image appropriate?"

Good prompt:

"Answer YES or NO only.
 NUDE_CHEST: Is bare chest visible?
 NUDE_CHEST: YES or NO"

Now the response looks like:

"NUDE_CHEST: NO"

That's something we can parse programmatically.

A Structured Description Prompt

First, let's see how to get organized descriptions. Instead of free-form text, we ask for specific categories:

Describe this image. Use plain text only. DO NOT use asterisks, bold,
or any markdown formatting.
List these aspects with a dash prefix:
- SUBJECT:
- SETTING:
- COLORS:
- CLOTHING:
- OBJECTS:
- MOOD:
Keep each line brief. No sub-bullets.

The response looks like:

SUBJECT: A young woman with long dark hair
SETTING: Urban rooftop at sunset
COLORS: Orange, purple, and deep blue tones
CLOTHING: Casual hoodie and jeans
OBJECTS: Coffee mug in hand
MOOD: Contemplative and peaceful

That's structured data we can parse. Each line starts with a known key.

A Moderation Prompt

For content moderation, we need YES/NO answers we can act on:

Answer YES or NO only for each question.
1. NUDE_CHEST: Is bare chest, breasts, or nipples visible?
2. NUDE_LOWER: Is bare buttocks or genitals visible?
3. WEAPON: Is the person holding a weapon (gun, knife, etc)?
NUDE_CHEST: YES or NO
NUDE_LOWER: YES or NO
WEAPON: YES or NO

Notice the structure:

Clear instructions ("Answer YES or NO only")
Numbered questions for clarity
Labeled output format (KEY: YES or NO)
Concrete things to look for

The response will look like:

NUDE_CHEST: NO
NUDE_LOWER: NO
WEAPON: YES

Why Concrete Questions Work

Vision models (especially smaller ones) do better when you ask about things that ARE there rather than things that AREN'T there.

Good:  "Is bare chest visible?"        (looking for presence)
Bad:   "Is there any nudity?"          (vague, looking for absence)

It's easier for the model to confirm "yes, I see X" than to exhaustively verify "no, there is nothing inappropriate anywhere."

Let's Test It

IMAGE_B64=$(base64 -i yourimage.png | perl -pe's~\s~~g')

curl -s http://localhost:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ministral",
    "prompt": "Answer YES or NO only for each question.\n1. NUDE_CHEST: Is bare chest, breasts, or nipples visible?\n2. NUDE_LOWER: Is bare buttocks or genitals visible?\n3. WEAPON: Is the person holding a weapon (gun, knife, etc)?\nNUDE_CHEST: YES or NO\nNUDE_LOWER: YES or NO\nWEAPON: YES or NO",
    "images": ["'"$IMAGE_B64"'"],
    "stream": false
  }' | jq -r '.response'

Try this on several images. Notice how the responses follow the structured format we requested.

Tuning Your Prompts

The prompt is the programmable part. By changing the prompt, you change what the model looks for without changing any code.

If you're getting inconsistent results:

Make questions more specific
Use simpler language
Reduce the number of questions
Lower the temperature setting (more on this later)