Techalicious Academy / 2026-02-24-openclaw-ollama

(Visit our meetup for more great tutorials)

CHOOSING AND PULLING A MODEL

Not all 14 billion parameter models are created equal. The model choice matters more than you'd expect because OpenClaw relies heavily on tool calling, also known as function calling. The agent needs to invoke shell commands, read files, interact with APIs, and chain multiple actions together. A model that can write nice paragraphs but can't reliably call tools is useless for this purpose.

The 14B Tier, Ranked for OpenClaw

Here's what's available and how well each works with OpenClaw's agent workflow:

qwen2.5-coder:14b

  This is the top recommendation. 9.0GB download. 32K context window.
  Apache 2.0 license. It has the best tool-calling support at the 14B
  tier and is the most recommended model in the OpenClaw community for
  local setups. It was built for code, which means it understands
  structured output and function signatures natively.

qwen3:14b

  Latest generation Qwen. Supports thinking mode. 128K native context
  window (though you'll rarely use that much locally). Excellent
  general-purpose model. Good tool calling but the coder variant is
  still better for OpenClaw's structured agent workflows.

qwen2.5:14b

  Strong general-purpose model with 32K context. Slightly worse at
  tool calling than the coder variant, but perfectly usable. Good
  choice if you want a more conversational assistant that can also
  do tools.

phi4:14b

  Microsoft's model. Good quality output but less tested with
  OpenClaw's specific tool-calling format. You might run into
  edge cases where it doesn't format tool calls correctly. Fine for
  experimentation, less reliable for a demo.

deepseek-r1:14b

  This is the trap. DeepSeek-R1 benchmarks incredibly well on
  reasoning tasks. It looks great on paper. But it does NOT support
  tool calling. At all. OpenClaw's agent workflow requires the model
  to emit structured function calls, and DeepSeek-R1 simply cannot
  do it. The agent will fail to execute commands, read files, or use
  any tools. You'll see it describe what it WOULD do instead of
  actually doing it.

  Do not use DeepSeek-R1 for OpenClaw.

Pulling Your Model

Once you've picked a model (we're going with qwen2.5-coder:14b for this tutorial), pull it:

ollama pull qwen2.5-coder:14b

This downloads about 9GB. How long that takes depends on your internet connection:

100 Mbps connection:  roughly 12-15 minutes
250 Mbps connection:  roughly 5-6 minutes
500 Mbps connection:  roughly 2-3 minutes

If you're following this for a group session, download at home the night before. Do not attempt a 9GB download over conference Wi-Fi with thirty other people. It will take forever and might not finish. If your venue allows it, bring USB drives with the model files pre-loaded.

Testing The Model Interactively

Once the download finishes, let's make sure it works:

ollama run qwen2.5-coder:14b --verbose "Write a Python function to reverse a string"

You should see the model generate a response. The --verbose flag shows performance stats after the response completes, including tokens per second. Expected performance by chip:

M1 base:           8-12 tokens per second
M2/M3 Pro:         15-22 tokens per second
M3/M4 Max:         25-40 tokens per second

Those are generation speeds. The first response takes a few extra seconds while the model loads into GPU memory. Subsequent responses are faster.

Verifying GPU Usage

While the model is loaded, check that Metal acceleration is working:

ollama ps

This shows running models. Look at the "Processor" column. It should say "100% GPU". If it says "100% CPU" or shows a split like "50% GPU / 50% CPU", something is wrong. The model might be too large for your available memory and is partially offloading to CPU.

If you see CPU offloading on a 16GB machine, close browsers and other apps to free memory, then try again.

Testing The API

OpenClaw doesn't talk to Ollama through the chat interface. It uses the HTTP API. Let's verify that's working too:

curl http://localhost:11434/v1/models

This should return a JSON list that includes your model. Then test a chat completion:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen2.5-coder:14b","messages":[{"role":"user","content":"Hello!"}]}'

You should get a JSON response with the model's reply. If either of these fails, Ollama isn't running. Start it with "ollama serve" or launch the app.

What Quantization Means

You might have noticed "Q4_K_M" in the model name or description. This refers to quantization, which is the process of compressing model weights to use fewer bits per parameter.

The original model uses 16-bit floating point numbers: 2 bytes per parameter. A 14B model at full precision would need 28GB just for weights. That doesn't fit on most machines.

Q4_K_M compresses to roughly 4.5 bits per parameter. The quality loss is surprisingly small for most tasks. You'd need a side-by-side comparison to notice the difference.

Higher quantization (Q5_K_M, Q6_K, Q8_0) preserves more quality but uses more memory. Lower quantization (Q3_K_M, Q2_K) saves memory but starts to noticeably degrade output quality.

If you have 32GB+ of RAM, you might want to try Q5_K_M for slightly better quality. On 16GB, Q4_K_M is the sweet spot between quality and memory footprint. Q3_K_M works if you're really squeezed.

ollama pull qwen2.5-coder:14b-q5_K_M   # Better quality, ~11GB
ollama pull qwen2.5-coder:14b-q3_K_M   # Smaller, ~7.3GB

But for following along with this tutorial, the default qwen2.5-coder:14b (which is Q4_K_M) is what you want.

Model working? API responding? GPU at 100%? Good. Let's install OpenClaw.