10-performance-and-testing.txt

From: Running OpenClaw Locally with Ollama on Apple Silicon

PERFORMANCE TUNING AND END-TO-END TESTING =========================================== You've got everything installed, configured, and hardened. This last section covers getting the most out of Apple Silicon's hardware and verifying that the complete stack works before you rely on it. Keeping The Model Warm ------------------------ By default, Ollama unloads models from memory after 5 minutes of inactivity. The next request triggers a cold start that takes 10 to 30 seconds while the model reloads into GPU memory. For an always-ready assistant, keep the model in memory: export OLLAMA_KEEP_ALIVE="24h" This tells Ollama to keep models loaded for 24 hours after the last request. Set this in your shell profile (~/.zshrc) to make it permanent. You can also pre-warm the model after a reboot so it's ready before your first question: curl http://localhost:11434/api/generate \ -d '{"model":"qwen2.5-coder:14b","prompt":"warmup","keep_alive":"24h"}' Flash Attention and KV Cache Optimization ------------------------------------------- Two environment variables that improve memory efficiency: export OLLAMA_FLASH_ATTENTION=1 export OLLAMA_KV_CACHE_TYPE="q8_0" Flash attention reduces the memory overhead of the attention mechanism. The quantized KV cache (q8_0) saves approximately 50% of the memory used for conversation context. This matters most on 16GB machines where every megabyte counts. Add both to your shell profile. Custom Modelfile for Locked Settings -------------------------------------- Instead of relying on defaults, create a Modelfile that locks in your preferred context window: Create a file called Modelfile with these contents: FROM qwen2.5-coder:14b PARAMETER num_ctx 16384 PARAMETER num_gpu 99 Then build a custom model from it: ollama create openclaw-coder -f Modelfile Now reference "openclaw-coder" in your OpenClaw config instead of "qwen2.5-coder:14b". This ensures the context window is always 16K regardless of what Ollama's defaults happen to be. If you're on 16GB RAM, use 8192 instead of 16384 for num_ctx. On 32GB+, you can push to 32768. The num_gpu 99 parameter means "use all GPU layers." This is the default on Apple Silicon but making it explicit doesn't hurt. Context Window and OpenClaw's Overhead ----------------------------------------- Here's something that isn't obvious: OpenClaw's system prompt alone consumes approximately 17,000 tokens. The agent also uses compaction and memory files to manage long conversations, but tool-heavy workflows fill context fast. On a 16GB machine with num_ctx set to 8192, you have about 8192 minus 17000... wait. That's a problem. The system prompt alone exceeds 8K context. In practice, OpenClaw compacts the system prompt and manages context more intelligently than raw token counting suggests. But the point stands: keep your context window as large as your RAM allows. On 16GB, use 8192 and accept that complex multi-step tasks may hit the ceiling. On 24GB+, use 16384 to 32768. Monitoring During Operation ----------------------------- Keep an eye on these while using OpenClaw: GPU utilization: ollama ps Should show "100% GPU" in the processor column. If it shows any CPU split, the model is partially offloaded and will be slower. Memory pressure: Open Activity Monitor, click Memory tab. The "Memory Pressure" graph at the bottom should stay green. Yellow means you're starting to swap. Red means pain. Token speed: ollama run qwen2.5-coder:14b --verbose "test" Look for "eval rate: XX.XX tokens/s" in the output. This is your generation speed. Compare against the expected numbers for your chip (8-12 for M1, 15-22 for Pro, 25-40 for Max). Thermal Throttling on Laptops ------------------------------- The M-series chips sustain peak GPU clocks for about 90 seconds before thermal throttling kicks in. On a MacBook, sustained inference gets maybe 10-15% slower after the first couple of minutes. For a demo or extended use: Elevate the laptop for better airflow under the chassis Use a desk fan pointed at the base Avoid clamshell mode (closed lid with external display) unless you have good ventilation around the hinge area This is minor for most uses. It only matters during long conversations where the model runs continuously for minutes at a time. End-to-End Verification Checklist ------------------------------------ Run these commands in sequence to confirm everything works: Step 1: Ollama is running and model is loaded curl http://localhost:11434/ ollama list | grep qwen2.5-coder ollama ps Step 2: OpenClaw gateway is up openclaw gateway status curl -s http://localhost:18789/health Step 3: OpenClaw can see the Ollama model openclaw models list openclaw models status Step 4: Full diagnostic openclaw doctor --fix Step 5: Security audit (do not skip this) openclaw security audit --deep Step 6: Verify no cloud fallback openclaw logs --follow # In another terminal, send a test message through TUI or Dashboard # Watch logs for any "anthropic" or "openai" references Test Conversations -------------------- Open the Dashboard or TUI and try these in order: Test 1: "What day is it today?" Confirms the model responds at all. If nothing comes back, check the troubleshooting section. Test 2: "List the files in my current directory" Confirms tool calling works. The agent should actually execute a shell command and return real filesystem contents. Not describe what it would do. Actually do it. Test 3: "Create a file called test.txt with 'Hello from OpenClaw' inside it, then read it back to me" Confirms multi-step tool use and file I/O. Test 4: "What's in my ~/.openclaw/openclaw.json file?" Confirms file reading works and the agent has appropriate filesystem access. If all four pass, your stack is working correctly. Group Session Tips ------------------- If you're setting this up for multiple people at once (meetup, workshop, team session), here's what to prepare in advance: Pre-download models on every machine. Thirty people pulling 9GB simultaneously over the same Wi-Fi will saturate any connection. If your venue allows it, distribute the model file via USB drives. Set OPENCLAW_DISABLE_BONJOUR=1 on every machine. Without this, every OpenClaw instance broadcasts its presence on the network. With thirty instances on the same Wi-Fi, you get mDNS collisions, device discovery confusion, and potentially crashes. Use the Web Dashboard, not messaging platforms. For a group demo, the Dashboard is simpler and more reliable. No external service dependencies, no pairing codes, no account issues. Budget timing realistically: Ollama install: 2 minutes Model pull (pre-downloaded): 0 minutes Model pull (live, 100 Mbps): 12-15 minutes OpenClaw install: 1-3 minutes Configuration + onboarding: 5 minutes Security audit + hardening: 3 minutes First successful conversation: under 1 minute Pre-prepared attendees: about 15 minutes total. Downloading models live: 30+ minutes. Plan accordingly. +----------------------------------------------------------+ | The two things that matter most: | | | | 1. Pre-download the model. Nothing kills momentum like | | thirty people waiting for a 9GB download. | | | | 2. Run the security audit. Not later. Not eventually. | | Before you use it. Before you demo it. | +----------------------------------------------------------+ That's the complete guide. You have a fully local AI assistant running on your Mac's GPU, configured to never contact any cloud service, hardened against the known attack vectors, and optimized for your specific hardware. Whether you decide to use it daily or just wanted to understand how local AI works, you now have all the pieces. Use it wisely. And if the lobsters start asking you to join Crustafarianism, maybe take a break from the terminal for a bit. Stay curious. Be careful. Have fun.

← Back to tutorial