04-choosing-a-model.txt
From: Running OpenClaw Locally with Ollama on Apple Silicon
CHOOSING AND PULLING A MODEL
==============================
Not all 14 billion parameter models are created equal. The model choice
matters more than you'd expect because OpenClaw relies heavily on tool
calling, also known as function calling. The agent needs to invoke shell
commands, read files, interact with APIs, and chain multiple actions
together. A model that can write nice paragraphs but can't reliably call
tools is useless for this purpose.
The 14B Tier, Ranked for OpenClaw
-----------------------------------
Here's what's available and how well each works with OpenClaw's agent
workflow:
qwen2.5-coder:14b
This is the top recommendation. 9.0GB download. 32K context window.
Apache 2.0 license. It has the best tool-calling support at the 14B
tier and is the most recommended model in the OpenClaw community for
local setups. It was built for code, which means it understands
structured output and function signatures natively.
qwen3:14b
Latest generation Qwen. Supports thinking mode. 128K native context
window (though you'll rarely use that much locally). Excellent
general-purpose model. Good tool calling but the coder variant is
still better for OpenClaw's structured agent workflows.
qwen2.5:14b
Strong general-purpose model with 32K context. Slightly worse at
tool calling than the coder variant, but perfectly usable. Good
choice if you want a more conversational assistant that can also
do tools.
phi4:14b
Microsoft's model. Good quality output but less tested with
OpenClaw's specific tool-calling format. You might run into
edge cases where it doesn't format tool calls correctly. Fine for
experimentation, less reliable for a demo.
deepseek-r1:14b
This is the trap. DeepSeek-R1 benchmarks incredibly well on
reasoning tasks. It looks great on paper. But it does NOT support
tool calling. At all. OpenClaw's agent workflow requires the model
to emit structured function calls, and DeepSeek-R1 simply cannot
do it. The agent will fail to execute commands, read files, or use
any tools. You'll see it describe what it WOULD do instead of
actually doing it.
Do not use DeepSeek-R1 for OpenClaw.
+----------------------------------------------------------+
| DeepSeek-R1 14B is a trap. |
| Benchmarks great. No tool calling. Useless for agents. |
| Use qwen2.5-coder:14b instead. |
+----------------------------------------------------------+
Pulling Your Model
-------------------
Once you've picked a model (we're going with qwen2.5-coder:14b for
this tutorial), pull it:
ollama pull qwen2.5-coder:14b
This downloads about 9GB. How long that takes depends on your internet
connection:
100 Mbps connection: roughly 12-15 minutes
250 Mbps connection: roughly 5-6 minutes
500 Mbps connection: roughly 2-3 minutes
If you're following this for a group session, download at home the
night before. Do not attempt a 9GB download over conference Wi-Fi with
thirty other people. It will take forever and might not finish. If your
venue allows it, bring USB drives with the model files pre-loaded.
Testing The Model Interactively
---------------------------------
Once the download finishes, let's make sure it works:
ollama run qwen2.5-coder:14b --verbose "Write a Python function to reverse a string"
You should see the model generate a response. The --verbose flag shows
performance stats after the response completes, including tokens per
second. Expected performance by chip:
M1 base: 8-12 tokens per second
M2/M3 Pro: 15-22 tokens per second
M3/M4 Max: 25-40 tokens per second
Those are generation speeds. The first response takes a few extra
seconds while the model loads into GPU memory. Subsequent responses
are faster.
Verifying GPU Usage
--------------------
While the model is loaded, check that Metal acceleration is working:
ollama ps
This shows running models. Look at the "Processor" column. It should
say "100% GPU". If it says "100% CPU" or shows a split like "50% GPU /
50% CPU", something is wrong. The model might be too large for your
available memory and is partially offloading to CPU.
If you see CPU offloading on a 16GB machine, close browsers and
other apps to free memory, then try again.
Testing The API
-----------------
OpenClaw doesn't talk to Ollama through the chat interface. It uses
the HTTP API. Let's verify that's working too:
curl http://localhost:11434/v1/models
This should return a JSON list that includes your model. Then test a
chat completion:
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"qwen2.5-coder:14b","messages":[{"role":"user","content":"Hello!"}]}'
You should get a JSON response with the model's reply. If either of
these fails, Ollama isn't running. Start it with "ollama serve" or
launch the app.
What Quantization Means
-------------------------
You might have noticed "Q4_K_M" in the model name or description.
This refers to quantization, which is the process of compressing model
weights to use fewer bits per parameter.
The original model uses 16-bit floating point numbers: 2 bytes per
parameter. A 14B model at full precision would need 28GB just for
weights. That doesn't fit on most machines.
Q4_K_M compresses to roughly 4.5 bits per parameter. The quality loss
is surprisingly small for most tasks. You'd need a side-by-side
comparison to notice the difference.
Higher quantization (Q5_K_M, Q6_K, Q8_0) preserves more quality but
uses more memory. Lower quantization (Q3_K_M, Q2_K) saves memory but
starts to noticeably degrade output quality.
If you have 32GB+ of RAM, you might want to try Q5_K_M for slightly
better quality. On 16GB, Q4_K_M is the sweet spot between quality and
memory footprint. Q3_K_M works if you're really squeezed.
ollama pull qwen2.5-coder:14b-q5_K_M # Better quality, ~11GB
ollama pull qwen2.5-coder:14b-q3_K_M # Smaller, ~7.3GB
But for following along with this tutorial, the default
qwen2.5-coder:14b (which is Q4_K_M) is what you want.
Model working? API responding? GPU at 100%? Good. Let's install
OpenClaw.