CHOOSING AND PULLING A MODEL
Not all 14 billion parameter models are created equal. The model choice matters more than you'd expect because OpenClaw relies heavily on tool calling, also known as function calling. The agent needs to invoke shell commands, read files, interact with APIs, and chain multiple actions together. A model that can write nice paragraphs but can't reliably call tools is useless for this purpose.
The 14B Tier, Ranked for OpenClaw
Here's what's available and how well each works with OpenClaw's agent workflow:
qwen2.5-coder:14b
This is the top recommendation. 9.0GB download. 32K context window.
Apache 2.0 license. It has the best tool-calling support at the 14B
tier and is the most recommended model in the OpenClaw community for
local setups. It was built for code, which means it understands
structured output and function signatures natively.
qwen3:14b
Latest generation Qwen. Supports thinking mode. 128K native context
window (though you'll rarely use that much locally). Excellent
general-purpose model. Good tool calling but the coder variant is
still better for OpenClaw's structured agent workflows.
qwen2.5:14b
Strong general-purpose model with 32K context. Slightly worse at
tool calling than the coder variant, but perfectly usable. Good
choice if you want a more conversational assistant that can also
do tools.
phi4:14b
Microsoft's model. Good quality output but less tested with
OpenClaw's specific tool-calling format. You might run into
edge cases where it doesn't format tool calls correctly. Fine for
experimentation, less reliable for a demo.
deepseek-r1:14b
This is the trap. DeepSeek-R1 benchmarks incredibly well on
reasoning tasks. It looks great on paper. But it does NOT support
tool calling. At all. OpenClaw's agent workflow requires the model
to emit structured function calls, and DeepSeek-R1 simply cannot
do it. The agent will fail to execute commands, read files, or use
any tools. You'll see it describe what it WOULD do instead of
actually doing it.
Do not use DeepSeek-R1 for OpenClaw.
Pulling Your Model
Once you've picked a model (we're going with qwen2.5-coder:14b for this tutorial), pull it:
ollama pull qwen2.5-coder:14b
This downloads about 9GB. How long that takes depends on your internet connection:
100 Mbps connection: roughly 12-15 minutes
250 Mbps connection: roughly 5-6 minutes
500 Mbps connection: roughly 2-3 minutes
If you're following this for a group session, download at home the night before. Do not attempt a 9GB download over conference Wi-Fi with thirty other people. It will take forever and might not finish. If your venue allows it, bring USB drives with the model files pre-loaded.
Testing The Model Interactively
Once the download finishes, let's make sure it works:
ollama run qwen2.5-coder:14b --verbose "Write a Python function to reverse a string"
You should see the model generate a response. The --verbose flag shows performance stats after the response completes, including tokens per second. Expected performance by chip:
M1 base: 8-12 tokens per second
M2/M3 Pro: 15-22 tokens per second
M3/M4 Max: 25-40 tokens per second
Those are generation speeds. The first response takes a few extra seconds while the model loads into GPU memory. Subsequent responses are faster.
Verifying GPU Usage
While the model is loaded, check that Metal acceleration is working:
ollama ps
This shows running models. Look at the "Processor" column. It should say "100% GPU". If it says "100% CPU" or shows a split like "50% GPU / 50% CPU", something is wrong. The model might be too large for your available memory and is partially offloading to CPU.
If you see CPU offloading on a 16GB machine, close browsers and other apps to free memory, then try again.
Testing The API
OpenClaw doesn't talk to Ollama through the chat interface. It uses the HTTP API. Let's verify that's working too:
curl http://localhost:11434/v1/models
This should return a JSON list that includes your model. Then test a chat completion:
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"qwen2.5-coder:14b","messages":[{"role":"user","content":"Hello!"}]}'
You should get a JSON response with the model's reply. If either of these fails, Ollama isn't running. Start it with "ollama serve" or launch the app.
What Quantization Means
You might have noticed "Q4_K_M" in the model name or description. This refers to quantization, which is the process of compressing model weights to use fewer bits per parameter.
The original model uses 16-bit floating point numbers: 2 bytes per parameter. A 14B model at full precision would need 28GB just for weights. That doesn't fit on most machines.
Q4_K_M compresses to roughly 4.5 bits per parameter. The quality loss is surprisingly small for most tasks. You'd need a side-by-side comparison to notice the difference.
Higher quantization (Q5_K_M, Q6_K, Q8_0) preserves more quality but uses more memory. Lower quantization (Q3_K_M, Q2_K) saves memory but starts to noticeably degrade output quality.
If you have 32GB+ of RAM, you might want to try Q5_K_M for slightly better quality. On 16GB, Q4_K_M is the sweet spot between quality and memory footprint. Q3_K_M works if you're really squeezed.
ollama pull qwen2.5-coder:14b-q5_K_M # Better quality, ~11GB
ollama pull qwen2.5-coder:14b-q3_K_M # Smaller, ~7.3GB
But for following along with this tutorial, the default qwen2.5-coder:14b (which is Q4_K_M) is what you want.
Model working? API responding? GPU at 100%? Good. Let's install OpenClaw.