PERFORMANCE TUNING AND END-TO-END TESTING
You've got everything installed, configured, and hardened. This last section covers getting the most out of Apple Silicon's hardware and verifying that the complete stack works before you rely on it.
Keeping The Model Warm
By default, Ollama unloads models from memory after 5 minutes of inactivity. The next request triggers a cold start that takes 10 to 30 seconds while the model reloads into GPU memory. For an always-ready assistant, keep the model in memory:
export OLLAMA_KEEP_ALIVE="24h"
This tells Ollama to keep models loaded for 24 hours after the last request. Set this in your shell profile (~/.zshrc) to make it permanent.
You can also pre-warm the model after a reboot so it's ready before your first question:
curl http://localhost:11434/api/generate \
-d '{"model":"qwen2.5-coder:14b","prompt":"warmup","keep_alive":"24h"}'
Flash Attention and KV Cache Optimization
Two environment variables that improve memory efficiency:
export OLLAMA_FLASH_ATTENTION=1
export OLLAMA_KV_CACHE_TYPE="q8_0"
Flash attention reduces the memory overhead of the attention mechanism. The quantized KV cache (q8_0) saves approximately 50% of the memory used for conversation context. This matters most on 16GB machines where every megabyte counts.
Add both to your shell profile.
Custom Modelfile for Locked Settings
Instead of relying on defaults, create a Modelfile that locks in your preferred context window:
Create a file called Modelfile with these contents:
FROM qwen2.5-coder:14b
PARAMETER num_ctx 16384
PARAMETER num_gpu 99
Then build a custom model from it:
ollama create openclaw-coder -f Modelfile
Now reference "openclaw-coder" in your OpenClaw config instead of "qwen2.5-coder:14b". This ensures the context window is always 16K regardless of what Ollama's defaults happen to be.
If you're on 16GB RAM, use 8192 instead of 16384 for num_ctx. On 32GB+, you can push to 32768.
The num_gpu 99 parameter means "use all GPU layers." This is the default on Apple Silicon but making it explicit doesn't hurt.
Context Window and OpenClaw's Overhead
Here's something that isn't obvious: OpenClaw's system prompt alone consumes approximately 17,000 tokens. The agent also uses compaction and memory files to manage long conversations, but tool-heavy workflows fill context fast.
On a 16GB machine with num_ctx set to 8192, you have about 8192 minus 17000... wait. That's a problem. The system prompt alone exceeds 8K context.
In practice, OpenClaw compacts the system prompt and manages context more intelligently than raw token counting suggests. But the point stands: keep your context window as large as your RAM allows. On 16GB, use 8192 and accept that complex multi-step tasks may hit the ceiling. On 24GB+, use 16384 to 32768.
Monitoring During Operation
Keep an eye on these while using OpenClaw:
GPU utilization:
ollama ps
Should show "100% GPU" in the processor column. If it shows any
CPU split, the model is partially offloaded and will be slower.
Memory pressure:
Open Activity Monitor, click Memory tab. The "Memory Pressure"
graph at the bottom should stay green. Yellow means you're starting
to swap. Red means pain.
Token speed:
ollama run qwen2.5-coder:14b --verbose "test"
Look for "eval rate: XX.XX tokens/s" in the output. This is your
generation speed. Compare against the expected numbers for your
chip (8-12 for M1, 15-22 for Pro, 25-40 for Max).
Thermal Throttling on Laptops
The M-series chips sustain peak GPU clocks for about 90 seconds before thermal throttling kicks in. On a MacBook, sustained inference gets maybe 10-15% slower after the first couple of minutes.
For a demo or extended use:
Elevate the laptop for better airflow under the chassis
Use a desk fan pointed at the base
Avoid clamshell mode (closed lid with external display) unless
you have good ventilation around the hinge area
This is minor for most uses. It only matters during long conversations where the model runs continuously for minutes at a time.
End-to-End Verification Checklist
Run these commands in sequence to confirm everything works:
Step 1: Ollama is running and model is loaded
curl http://localhost:11434/
ollama list | grep qwen2.5-coder
ollama ps
Step 2: OpenClaw gateway is up
openclaw gateway status
curl -s http://localhost:18789/health
Step 3: OpenClaw can see the Ollama model
openclaw models list
openclaw models status
Step 4: Full diagnostic
openclaw doctor --fix
Step 5: Security audit (do not skip this)
openclaw security audit --deep
Step 6: Verify no cloud fallback
openclaw logs --follow
# In another terminal, send a test message through TUI or Dashboard
# Watch logs for any "anthropic" or "openai" references
Test Conversations
Open the Dashboard or TUI and try these in order:
Test 1: "What day is it today?"
Confirms the model responds at all. If nothing comes back, check
the troubleshooting section.
Test 2: "List the files in my current directory"
Confirms tool calling works. The agent should actually execute a
shell command and return real filesystem contents. Not describe
what it would do. Actually do it.
Test 3: "Create a file called test.txt with 'Hello from OpenClaw'
inside it, then read it back to me"
Confirms multi-step tool use and file I/O.
Test 4: "What's in my ~/.openclaw/openclaw.json file?"
Confirms file reading works and the agent has appropriate
filesystem access.
If all four pass, your stack is working correctly.
Group Session Tips
If you're setting this up for multiple people at once (meetup, workshop, team session), here's what to prepare in advance:
Pre-download models on every machine. Thirty people pulling 9GB
simultaneously over the same Wi-Fi will saturate any connection. If
your venue allows it, distribute the model file via USB drives.
Set OPENCLAW_DISABLE_BONJOUR=1 on every machine. Without this,
every OpenClaw instance broadcasts its presence on the network. With
thirty instances on the same Wi-Fi, you get mDNS collisions, device
discovery confusion, and potentially crashes.
Use the Web Dashboard, not messaging platforms. For a group demo, the
Dashboard is simpler and more reliable. No external service
dependencies, no pairing codes, no account issues.
Budget timing realistically:
Ollama install: 2 minutes
Model pull (pre-downloaded): 0 minutes
Model pull (live, 100 Mbps): 12-15 minutes
OpenClaw install: 1-3 minutes
Configuration + onboarding: 5 minutes
Security audit + hardening: 3 minutes
First successful conversation: under 1 minute
Pre-prepared attendees: about 15 minutes total.
Downloading models live: 30+ minutes. Plan accordingly.
That's the complete guide. You have a fully local AI assistant running on your Mac's GPU, configured to never contact any cloud service, hardened against the known attack vectors, and optimized for your specific hardware.
Whether you decide to use it daily or just wanted to understand how local AI works, you now have all the pieces. Use it wisely. And if the lobsters start asking you to join Crustafarianism, maybe take a break from the terminal for a bit.
Stay curious. Be careful. Have fun.