10-performance-and-testing.txt
From: Running OpenClaw Locally with Ollama on Apple Silicon
PERFORMANCE TUNING AND END-TO-END TESTING
===========================================
You've got everything installed, configured, and hardened. This last
section covers getting the most out of Apple Silicon's hardware and
verifying that the complete stack works before you rely on it.
Keeping The Model Warm
------------------------
By default, Ollama unloads models from memory after 5 minutes of
inactivity. The next request triggers a cold start that takes 10 to
30 seconds while the model reloads into GPU memory. For an always-ready
assistant, keep the model in memory:
export OLLAMA_KEEP_ALIVE="24h"
This tells Ollama to keep models loaded for 24 hours after the last
request. Set this in your shell profile (~/.zshrc) to make it
permanent.
You can also pre-warm the model after a reboot so it's ready before
your first question:
curl http://localhost:11434/api/generate \
-d '{"model":"qwen2.5-coder:14b","prompt":"warmup","keep_alive":"24h"}'
Flash Attention and KV Cache Optimization
-------------------------------------------
Two environment variables that improve memory efficiency:
export OLLAMA_FLASH_ATTENTION=1
export OLLAMA_KV_CACHE_TYPE="q8_0"
Flash attention reduces the memory overhead of the attention mechanism.
The quantized KV cache (q8_0) saves approximately 50% of the memory
used for conversation context. This matters most on 16GB machines where
every megabyte counts.
Add both to your shell profile.
Custom Modelfile for Locked Settings
--------------------------------------
Instead of relying on defaults, create a Modelfile that locks in your
preferred context window:
Create a file called Modelfile with these contents:
FROM qwen2.5-coder:14b
PARAMETER num_ctx 16384
PARAMETER num_gpu 99
Then build a custom model from it:
ollama create openclaw-coder -f Modelfile
Now reference "openclaw-coder" in your OpenClaw config instead of
"qwen2.5-coder:14b". This ensures the context window is always 16K
regardless of what Ollama's defaults happen to be.
If you're on 16GB RAM, use 8192 instead of 16384 for num_ctx. On
32GB+, you can push to 32768.
The num_gpu 99 parameter means "use all GPU layers." This is the
default on Apple Silicon but making it explicit doesn't hurt.
Context Window and OpenClaw's Overhead
-----------------------------------------
Here's something that isn't obvious: OpenClaw's system prompt alone
consumes approximately 17,000 tokens. The agent also uses compaction
and memory files to manage long conversations, but tool-heavy
workflows fill context fast.
On a 16GB machine with num_ctx set to 8192, you have about 8192
minus 17000... wait. That's a problem. The system prompt alone
exceeds 8K context.
In practice, OpenClaw compacts the system prompt and manages context
more intelligently than raw token counting suggests. But the point
stands: keep your context window as large as your RAM allows. On 16GB,
use 8192 and accept that complex multi-step tasks may hit the ceiling.
On 24GB+, use 16384 to 32768.
Monitoring During Operation
-----------------------------
Keep an eye on these while using OpenClaw:
GPU utilization:
ollama ps
Should show "100% GPU" in the processor column. If it shows any
CPU split, the model is partially offloaded and will be slower.
Memory pressure:
Open Activity Monitor, click Memory tab. The "Memory Pressure"
graph at the bottom should stay green. Yellow means you're starting
to swap. Red means pain.
Token speed:
ollama run qwen2.5-coder:14b --verbose "test"
Look for "eval rate: XX.XX tokens/s" in the output. This is your
generation speed. Compare against the expected numbers for your
chip (8-12 for M1, 15-22 for Pro, 25-40 for Max).
Thermal Throttling on Laptops
-------------------------------
The M-series chips sustain peak GPU clocks for about 90 seconds before
thermal throttling kicks in. On a MacBook, sustained inference gets
maybe 10-15% slower after the first couple of minutes.
For a demo or extended use:
Elevate the laptop for better airflow under the chassis
Use a desk fan pointed at the base
Avoid clamshell mode (closed lid with external display) unless
you have good ventilation around the hinge area
This is minor for most uses. It only matters during long conversations
where the model runs continuously for minutes at a time.
End-to-End Verification Checklist
------------------------------------
Run these commands in sequence to confirm everything works:
Step 1: Ollama is running and model is loaded
curl http://localhost:11434/
ollama list | grep qwen2.5-coder
ollama ps
Step 2: OpenClaw gateway is up
openclaw gateway status
curl -s http://localhost:18789/health
Step 3: OpenClaw can see the Ollama model
openclaw models list
openclaw models status
Step 4: Full diagnostic
openclaw doctor --fix
Step 5: Security audit (do not skip this)
openclaw security audit --deep
Step 6: Verify no cloud fallback
openclaw logs --follow
# In another terminal, send a test message through TUI or Dashboard
# Watch logs for any "anthropic" or "openai" references
Test Conversations
--------------------
Open the Dashboard or TUI and try these in order:
Test 1: "What day is it today?"
Confirms the model responds at all. If nothing comes back, check
the troubleshooting section.
Test 2: "List the files in my current directory"
Confirms tool calling works. The agent should actually execute a
shell command and return real filesystem contents. Not describe
what it would do. Actually do it.
Test 3: "Create a file called test.txt with 'Hello from OpenClaw'
inside it, then read it back to me"
Confirms multi-step tool use and file I/O.
Test 4: "What's in my ~/.openclaw/openclaw.json file?"
Confirms file reading works and the agent has appropriate
filesystem access.
If all four pass, your stack is working correctly.
Group Session Tips
-------------------
If you're setting this up for multiple people at once (meetup, workshop,
team session), here's what to prepare in advance:
Pre-download models on every machine. Thirty people pulling 9GB
simultaneously over the same Wi-Fi will saturate any connection. If
your venue allows it, distribute the model file via USB drives.
Set OPENCLAW_DISABLE_BONJOUR=1 on every machine. Without this,
every OpenClaw instance broadcasts its presence on the network. With
thirty instances on the same Wi-Fi, you get mDNS collisions, device
discovery confusion, and potentially crashes.
Use the Web Dashboard, not messaging platforms. For a group demo, the
Dashboard is simpler and more reliable. No external service
dependencies, no pairing codes, no account issues.
Budget timing realistically:
Ollama install: 2 minutes
Model pull (pre-downloaded): 0 minutes
Model pull (live, 100 Mbps): 12-15 minutes
OpenClaw install: 1-3 minutes
Configuration + onboarding: 5 minutes
Security audit + hardening: 3 minutes
First successful conversation: under 1 minute
Pre-prepared attendees: about 15 minutes total.
Downloading models live: 30+ minutes. Plan accordingly.
+----------------------------------------------------------+
| The two things that matter most: |
| |
| 1. Pre-download the model. Nothing kills momentum like |
| thirty people waiting for a 9GB download. |
| |
| 2. Run the security audit. Not later. Not eventually. |
| Before you use it. Before you demo it. |
+----------------------------------------------------------+
That's the complete guide. You have a fully local AI assistant running
on your Mac's GPU, configured to never contact any cloud service,
hardened against the known attack vectors, and optimized for your
specific hardware.
Whether you decide to use it daily or just wanted to understand how
local AI works, you now have all the pieces. Use it wisely. And if
the lobsters start asking you to join Crustafarianism, maybe take a
break from the terminal for a bit.
Stay curious. Be careful. Have fun.