BUILDING YOUR OWN AUTOMATION
What we've shown is the foundation. You now have all the pieces:
- How to encode an image for the API
- How to send a request to Ollama
- How to craft prompts that give structured output
- How to parse the response with regex
- How to make a pass/fail decision
From here, building a full automation pipeline is up to you.
Batch Processing
Loop through a folder of images and check each one:
for img in *.png; do
echo "Checking $img..."
IMAGE_B64=$(base64 -i "$img" | perl -pe's~\s~~g')
RESPONSE=$(curl -s http://localhost:11434/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "ministral",
"prompt": "Answer YES or NO only.\n\nNUDE_CHEST: YES or NO",
"images": ["'"$IMAGE_B64"'"],
"stream": false
}' | jq -r '.response')
if echo "$RESPONSE" | grep -qi "NUDE_CHEST: YES"; then
echo " REJECT"
else
echo " PASS"
fi
done
Moving Rejected Files
Don't delete rejects. Move them to a separate folder for review:
mkdir -p ./rejects
for img in *.png; do
# ... run your check ...
if echo "$RESPONSE" | grep -qi "NUDE_CHEST: YES"; then
mv "$img" ./rejects/
echo "Moved to rejects: $img"
fi
done
Logging
Keep a record of what was checked and why:
LOGFILE="moderation.log"
for img in images/*.png; do
# ... run your check ...
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
if echo "$RESPONSE" | grep -qi "NUDE_CHEST: YES"; then
RESULT="REJECT"
REASON="Nudity detected"
else
RESULT="PASS"
REASON=""
fi
echo "$TIMESTAMP,$img,$RESULT,$REASON" >> "$LOGFILE"
done
This creates a CSV-style log you can review later.
Multiple Checks
You could run different prompts for different concerns:
Content moderation (what we showed)
Quality assessment (check for artifacts, extra fingers)
Style matching (does it match the requested style?)
Text detection (is there readable text in the image?)
Each check is just a different prompt with different parsing logic.
Model Settings
You can add an "options" block to your JSON request to control behavior. The temperature setting controls randomness:
"options": {
"temperature": 0.1
}
Temperature values:
0.1 More predictable, consistent
0.7 More creative, varied
1.0 Most random
For moderation, lower is better. We want consistent yes/no answers, not creative interpretation.
Other useful options:
"num_ctx": 4096 Context window size
"num_predict": 100 Max tokens to generate
Performance Considerations
Vision models are slower than text-only models. A single image analysis might take 5-30 seconds depending on your hardware and model size.
For batch processing hundreds of images:
- Process during off-hours
- Consider a smaller/faster model for initial screening
- Use a more thorough model for borderline cases
- Cache results (don't re-check unchanged images)
Other Use Cases
The pattern (encode, prompt, parse, act) works for many tasks:
Accessibility: Generate alt-text descriptions for images
Organization: Auto-tag photos by content
Quality Control: Detect blurry or corrupted images
Document Processing: Extract text from screenshots
Security: Detect sensitive information in images
The Key Insight
Once you understand the pattern, you can adapt it to almost any image analysis task. The prompt is what you change. The plumbing stays the same.
Build something. See what works. Iterate.