04-choosing-a-model.txt

From: Running OpenClaw Locally with Ollama on Apple Silicon

CHOOSING AND PULLING A MODEL ============================== Not all 14 billion parameter models are created equal. The model choice matters more than you'd expect because OpenClaw relies heavily on tool calling, also known as function calling. The agent needs to invoke shell commands, read files, interact with APIs, and chain multiple actions together. A model that can write nice paragraphs but can't reliably call tools is useless for this purpose. The 14B Tier, Ranked for OpenClaw ----------------------------------- Here's what's available and how well each works with OpenClaw's agent workflow: qwen2.5-coder:14b This is the top recommendation. 9.0GB download. 32K context window. Apache 2.0 license. It has the best tool-calling support at the 14B tier and is the most recommended model in the OpenClaw community for local setups. It was built for code, which means it understands structured output and function signatures natively. qwen3:14b Latest generation Qwen. Supports thinking mode. 128K native context window (though you'll rarely use that much locally). Excellent general-purpose model. Good tool calling but the coder variant is still better for OpenClaw's structured agent workflows. qwen2.5:14b Strong general-purpose model with 32K context. Slightly worse at tool calling than the coder variant, but perfectly usable. Good choice if you want a more conversational assistant that can also do tools. phi4:14b Microsoft's model. Good quality output but less tested with OpenClaw's specific tool-calling format. You might run into edge cases where it doesn't format tool calls correctly. Fine for experimentation, less reliable for a demo. deepseek-r1:14b This is the trap. DeepSeek-R1 benchmarks incredibly well on reasoning tasks. It looks great on paper. But it does NOT support tool calling. At all. OpenClaw's agent workflow requires the model to emit structured function calls, and DeepSeek-R1 simply cannot do it. The agent will fail to execute commands, read files, or use any tools. You'll see it describe what it WOULD do instead of actually doing it. Do not use DeepSeek-R1 for OpenClaw. +----------------------------------------------------------+ | DeepSeek-R1 14B is a trap. | | Benchmarks great. No tool calling. Useless for agents. | | Use qwen2.5-coder:14b instead. | +----------------------------------------------------------+ Pulling Your Model ------------------- Once you've picked a model (we're going with qwen2.5-coder:14b for this tutorial), pull it: ollama pull qwen2.5-coder:14b This downloads about 9GB. How long that takes depends on your internet connection: 100 Mbps connection: roughly 12-15 minutes 250 Mbps connection: roughly 5-6 minutes 500 Mbps connection: roughly 2-3 minutes If you're following this for a group session, download at home the night before. Do not attempt a 9GB download over conference Wi-Fi with thirty other people. It will take forever and might not finish. If your venue allows it, bring USB drives with the model files pre-loaded. Testing The Model Interactively --------------------------------- Once the download finishes, let's make sure it works: ollama run qwen2.5-coder:14b --verbose "Write a Python function to reverse a string" You should see the model generate a response. The --verbose flag shows performance stats after the response completes, including tokens per second. Expected performance by chip: M1 base: 8-12 tokens per second M2/M3 Pro: 15-22 tokens per second M3/M4 Max: 25-40 tokens per second Those are generation speeds. The first response takes a few extra seconds while the model loads into GPU memory. Subsequent responses are faster. Verifying GPU Usage -------------------- While the model is loaded, check that Metal acceleration is working: ollama ps This shows running models. Look at the "Processor" column. It should say "100% GPU". If it says "100% CPU" or shows a split like "50% GPU / 50% CPU", something is wrong. The model might be too large for your available memory and is partially offloading to CPU. If you see CPU offloading on a 16GB machine, close browsers and other apps to free memory, then try again. Testing The API ----------------- OpenClaw doesn't talk to Ollama through the chat interface. It uses the HTTP API. Let's verify that's working too: curl http://localhost:11434/v1/models This should return a JSON list that includes your model. Then test a chat completion: curl http://localhost:11434/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model":"qwen2.5-coder:14b","messages":[{"role":"user","content":"Hello!"}]}' You should get a JSON response with the model's reply. If either of these fails, Ollama isn't running. Start it with "ollama serve" or launch the app. What Quantization Means ------------------------- You might have noticed "Q4_K_M" in the model name or description. This refers to quantization, which is the process of compressing model weights to use fewer bits per parameter. The original model uses 16-bit floating point numbers: 2 bytes per parameter. A 14B model at full precision would need 28GB just for weights. That doesn't fit on most machines. Q4_K_M compresses to roughly 4.5 bits per parameter. The quality loss is surprisingly small for most tasks. You'd need a side-by-side comparison to notice the difference. Higher quantization (Q5_K_M, Q6_K, Q8_0) preserves more quality but uses more memory. Lower quantization (Q3_K_M, Q2_K) saves memory but starts to noticeably degrade output quality. If you have 32GB+ of RAM, you might want to try Q5_K_M for slightly better quality. On 16GB, Q4_K_M is the sweet spot between quality and memory footprint. Q3_K_M works if you're really squeezed. ollama pull qwen2.5-coder:14b-q5_K_M # Better quality, ~11GB ollama pull qwen2.5-coder:14b-q3_K_M # Smaller, ~7.3GB But for following along with this tutorial, the default qwen2.5-coder:14b (which is Q4_K_M) is what you want. Model working? API responding? GPU at 100%? Good. Let's install OpenClaw.

← Back to tutorial