THE MODELS

An AI interface without models is like a music player without songs. Ollama is our model library. Let's pull some good ones and talk about when you'd use each.

How Models Work on Your Mac

When you pull a model, Ollama downloads the weights (the "brain") to your disk. When you start chatting, it loads those weights into your unified memory. The model stays loaded until you switch to a different one or Ollama unloads it for memory.

The key constraint is RAM. The model has to fit in memory to run. Bigger models are smarter but need more RAM. Smaller models are faster but less capable. This is the fundamental trade-off of local AI.

Our Lineup Tonight

We're going to work with four models that cover different use cases. Think of them as specialists on your team.

1. Qwen3 30B-A3B: The Efficient All-Rounder

This is the star of the show. Qwen3 30B uses a trick called Mixture of Experts (MoE). It has 30.5 billion parameters total, but only activates 3.3 billion for each token it generates.

What does that mean practically? You get the intelligence of a large model with the speed and memory usage of a small one. It punches way above its weight.

ollama pull qwen3:30b-a3b

Download size:    ~19GB
RAM needed:       32GB recommended
Context window:   32K tokens
Good at:          Reasoning, code, math, creative writing, general
                  conversation. Speaks 100+ languages.

This is my go-to recommendation for anyone with a 32GB Mac. It's fast, it's capable, and it handles almost everything you throw at it.

2. Qwen3-Next: The Heavyweight

Qwen3-Next is the big sibling. 80 billion parameters total, with about 3.9 billion active per token. It supports a massive 256K context window, which means it can process entire books in a single conversation.

ollama pull qwen3-next

Download size:    ~50GB
RAM needed:       64GB minimum
Context window:   256K tokens
Good at:          Long documents, complex reasoning, detailed
                  analysis. The most capable model in our lineup.

Fair warning: this is a big model. It needs a 64GB Mac and eats about 50GB of disk space. If you've got the hardware, it's impressive. If you don't, skip this one and stick with the 30B.

3. Mistral Small 3.2: The Versatile Workhorse

Mistral Small 3.2 is a 24 billion parameter dense model from Mistral AI. "Dense" means every parameter is active every time, unlike the MoE models above.

ollama pull mistral-small3.2

Download size:    ~15GB
RAM needed:       32GB comfortable
Context window:   128K tokens
Good at:          Instruction following, multilingual tasks,
                  and here's the kicker: it can see images too.

Mistral Small 3.2 has built-in vision capabilities. You can send it a photo and ask questions about it. We'll demo this later. It's a solid choice if you want one model that does text AND vision without switching.

4. Qwen3-VL: The Vision Specialist

When you need serious image understanding, Qwen3-VL is the dedicated tool. Available in several sizes, the 8B version hits the sweet spot between capability and resource usage.

ollama pull qwen3-vl

Download size:    ~6GB (8B version)
RAM needed:       16GB comfortable
Context window:   256K tokens
Good at:          OCR in 32 languages, spatial reasoning, reading
                  charts and diagrams, analyzing screenshots. It
                  can even understand video frames.

If you want a lighter option:

ollama pull qwen3-vl:4b

That's the 4B version at just 3.3GB. Less capable but runs on practically anything.

Pulling the Models

Let's grab them. Open Terminal and run whichever models your hardware can handle:

For 16GB Macs (pick one or two):
  ollama pull qwen3-vl:4b
  ollama pull qwen3-vl

For 32GB Macs (the sweet spot):
  ollama pull qwen3:30b-a3b
  ollama pull mistral-small3.2
  ollama pull qwen3-vl

For 64GB+ Macs (go wild):
  ollama pull qwen3:30b-a3b
  ollama pull qwen3-next
  ollama pull mistral-small3.2
  ollama pull qwen3-vl

Downloads take a while. Start them now and we'll talk while they download.

Understanding Mixture of Experts

Two of our models (Qwen3 30B-A3B and Qwen3-Next) use Mixture of Experts architecture. This is worth understanding because it explains why they're so efficient.

A traditional "dense" model uses every parameter for every token. If the model has 24 billion parameters, all 24 billion activate every time it generates a word.

An MoE model has many "expert" sub-networks inside it. For each token, a router network picks which experts are relevant, and only those experts activate. The rest stay dormant.

Qwen3 30B-A3B has 128 experts but only activates 8 per token. So you're getting the knowledge of 30 billion parameters but only paying the computational cost of 3.3 billion. That's why it can run on a 32GB Mac while competing with models that need 64GB.

Think of it like a hospital. You don't need every specialist in every room for every patient. The ER triages you to the right doctors. MoE does the same thing with neural network experts.

Verify Your Models

Once the downloads finish:

ollama list

You should see all your pulled models listed with their sizes. These are now available in OpenWebUI. Refresh the page and they'll appear in the model dropdown at the top of the chat interface.

Next Up

Now that we have models, let's explore the OpenWebUI interface and start actually chatting.