VISION MODELS

One of the coolest things about running AI locally is using vision models. You upload an image and the model can see it. Describe it. Read text from it. Answer questions about it. All on your own hardware.

What Vision Models Can Do

A regular text model only understands words. A vision model understands words AND images. You can show it a photo, a screenshot, a chart, a handwritten note, or a diagram and have a conversation about what it sees.

Some examples of what works well:

Upload a screenshot of an error message. Ask it to explain the
error and suggest fixes.

Upload a photo of a whiteboard from a meeting. Ask it to
transcribe and organize the notes.

Upload a chart or graph. Ask it to describe the trends.

Upload a photo of handwritten text. Ask it to transcribe it.

Upload a product photo. Ask it to write a description for a
listing.

Upload a restaurant menu in a foreign language. Ask it to
translate.

Our Vision Models

We pulled two vision-capable models earlier:

Qwen3-VL (8B or 4B):

The dedicated vision specialist. Excellent OCR in 32 languages,
strong spatial reasoning, good at reading charts and diagrams.
Can even process video frames if you extract them. The 8B
version is the sweet spot. The 4B version is lighter but still
capable.

Mistral Small 3.2:

A 24B model that does both text and vision. Not as specialized
as Qwen3-VL for pure vision tasks, but very capable and you
don't have to switch models when you go from text chat to image
analysis.

Using Vision in OpenWebUI

Make sure you have a vision model selected in the model dropdown at the top.

Click the attachment icon next to the message input (it looks like a paperclip or a + sign)
Select an image file from your computer Supported formats: PNG, JPG, JPEG, GIF, WebP
The image appears as a thumbnail in the message area
Type your question about the image
Hit Enter

The model receives both your text and the image, then responds based on what it sees.

Demo: Reading a Screenshot

Try this right now. Take a screenshot of something on your screen. A web page, a piece of code, an error dialog, whatever.

On Mac: Cmd+Shift+4, then select an area

Upload that screenshot to a chat with Qwen3-VL selected and ask:

What do you see in this image? Describe everything.

The model should identify text, UI elements, colors, layout, and give you a detailed description. If it's a screenshot of code, it'll likely identify the programming language and describe what the code does.

Demo: OCR (Reading Text from Images)

Qwen3-VL is particularly strong at OCR, which stands for Optical Character Recognition. That's a fancy way of saying "reading text from images."

Try uploading:

A photo of a handwritten note
A screenshot of a PDF you can't copy text from
A photo of a sign, label, or receipt

Ask:

Transcribe all the text in this image exactly as written.

Qwen3-VL handles 32 languages for OCR, so it works with text in English, French, Chinese, Arabic, Japanese, and many more.

Demo: Analyzing Charts

Upload a chart, graph, or data visualization and try:

What trends do you see in this chart?
What are the key data points?
Summarize the information in this graph.

Vision models can identify bar charts, line graphs, pie charts, and scatter plots. They'll describe axes, labels, trends, and outliers. Not perfect, but often surprisingly good.

Setting Up Vision on Custom Models

When creating a custom model in the Workspace, there's a Vision toggle in the Capabilities section. Make sure this is enabled if your base model supports it.

You could create a custom model like:

Name:         Screenshot Analyst
Base Model:   qwen3-vl
System Prompt:

You analyze screenshots and images. When the user uploads an
image, provide a thorough description of what you see, including
all text (transcribed exactly), UI elements, layout, and any
notable details. If the image contains code, identify the
language and explain what the code does.

Vision:       Enabled

Now you have a dedicated tool for screen analysis that you can switch to whenever needed.

Limitations

Vision models are good but not perfect. A few things to be aware of:

Small text in large images can be missed or misread. If you need
precise OCR, crop the image to focus on the text area.

Complex diagrams with many overlapping elements can confuse the
model. Simpler, cleaner images work better.

The model sometimes "hallucinates" details that aren't in the
image. If accuracy matters, verify its claims against the actual
image.

Video is not directly supported in OpenWebUI. You'd need to
extract individual frames and upload them as images.

Image resolution matters. Very low-resolution images produce
worse results. If possible, upload the highest quality version
you have.

Vision vs Multimodal

You'll sometimes hear the term "multimodal" in AI discussions. It just means the model can handle multiple types of input. A vision model is multimodal because it handles both text and images. Some models can also handle audio, but that's less common in local setups right now.

In OpenWebUI, multimodal practically means vision. If a model supports vision, you can send it images. If it doesn't, the image upload won't do anything useful.

Next Up

Let's talk about what to do when things go wrong. Troubleshooting common issues and keeping your setup running smoothly.