PERFORMANCE TUNING AND END-TO-END TESTING

You've got everything installed, configured, and hardened. This last section covers getting the most out of Apple Silicon's hardware and verifying that the complete stack works before you rely on it.

Keeping The Model Warm

By default, Ollama unloads models from memory after 5 minutes of inactivity. The next request triggers a cold start that takes 10 to 30 seconds while the model reloads into GPU memory. For an always-ready assistant, keep the model in memory:

export OLLAMA_KEEP_ALIVE="24h"

This tells Ollama to keep models loaded for 24 hours after the last request. Set this in your shell profile (~/.zshrc) to make it permanent.

You can also pre-warm the model after a reboot so it's ready before your first question:

curl http://localhost:11434/api/generate \
  -d '{"model":"qwen2.5-coder:14b","prompt":"warmup","keep_alive":"24h"}'

Flash Attention and KV Cache Optimization

Two environment variables that improve memory efficiency:

export OLLAMA_FLASH_ATTENTION=1
export OLLAMA_KV_CACHE_TYPE="q8_0"

Flash attention reduces the memory overhead of the attention mechanism. The quantized KV cache (q8_0) saves approximately 50% of the memory used for conversation context. This matters most on 16GB machines where every megabyte counts.

Add both to your shell profile.

Custom Modelfile for Locked Settings

Instead of relying on defaults, create a Modelfile that locks in your preferred context window:

Create a file called Modelfile with these contents:

FROM qwen2.5-coder:14b
PARAMETER num_ctx 16384
PARAMETER num_gpu 99

Then build a custom model from it:

ollama create openclaw-coder -f Modelfile

Now reference "openclaw-coder" in your OpenClaw config instead of "qwen2.5-coder:14b". This ensures the context window is always 16K regardless of what Ollama's defaults happen to be.

If you're on 16GB RAM, use 8192 instead of 16384 for num_ctx. On 32GB+, you can push to 32768.

The num_gpu 99 parameter means "use all GPU layers." This is the default on Apple Silicon but making it explicit doesn't hurt.

Context Window and OpenClaw's Overhead

Here's something that isn't obvious: OpenClaw's system prompt alone consumes approximately 17,000 tokens. The agent also uses compaction and memory files to manage long conversations, but tool-heavy workflows fill context fast.

On a 16GB machine with num_ctx set to 8192, you have about 8192 minus 17000... wait. That's a problem. The system prompt alone exceeds 8K context.

In practice, OpenClaw compacts the system prompt and manages context more intelligently than raw token counting suggests. But the point stands: keep your context window as large as your RAM allows. On 16GB, use 8192 and accept that complex multi-step tasks may hit the ceiling. On 24GB+, use 16384 to 32768.

Monitoring During Operation

Keep an eye on these while using OpenClaw:

GPU utilization:

  ollama ps

  Should show "100% GPU" in the processor column. If it shows any
  CPU split, the model is partially offloaded and will be slower.

Memory pressure:

  Open Activity Monitor, click Memory tab. The "Memory Pressure"
  graph at the bottom should stay green. Yellow means you're starting
  to swap. Red means pain.

Token speed:

  ollama run qwen2.5-coder:14b --verbose "test"

  Look for "eval rate: XX.XX tokens/s" in the output. This is your
  generation speed. Compare against the expected numbers for your
  chip (8-12 for M1, 15-22 for Pro, 25-40 for Max).

Thermal Throttling on Laptops

The M-series chips sustain peak GPU clocks for about 90 seconds before thermal throttling kicks in. On a MacBook, sustained inference gets maybe 10-15% slower after the first couple of minutes.

For a demo or extended use:

Elevate the laptop for better airflow under the chassis
Use a desk fan pointed at the base
Avoid clamshell mode (closed lid with external display) unless
you have good ventilation around the hinge area

This is minor for most uses. It only matters during long conversations where the model runs continuously for minutes at a time.

End-to-End Verification Checklist

Run these commands in sequence to confirm everything works:

Step 1: Ollama is running and model is loaded

  curl http://localhost:11434/
  ollama list | grep qwen2.5-coder
  ollama ps

Step 2: OpenClaw gateway is up

  openclaw gateway status
  curl -s http://localhost:18789/health

Step 3: OpenClaw can see the Ollama model

  openclaw models list
  openclaw models status

Step 4: Full diagnostic

  openclaw doctor --fix

Step 5: Security audit (do not skip this)

  openclaw security audit --deep

Step 6: Verify no cloud fallback

  openclaw logs --follow
  # In another terminal, send a test message through TUI or Dashboard
  # Watch logs for any "anthropic" or "openai" references

Test Conversations

Open the Dashboard or TUI and try these in order:

Test 1: "What day is it today?"
Confirms the model responds at all. If nothing comes back, check
the troubleshooting section.

Test 2: "List the files in my current directory"
Confirms tool calling works. The agent should actually execute a
shell command and return real filesystem contents. Not describe
what it would do. Actually do it.

Test 3: "Create a file called test.txt with 'Hello from OpenClaw'
inside it, then read it back to me"
Confirms multi-step tool use and file I/O.

Test 4: "What's in my ~/.openclaw/openclaw.json file?"
Confirms file reading works and the agent has appropriate
filesystem access.

If all four pass, your stack is working correctly.

Group Session Tips

If you're setting this up for multiple people at once (meetup, workshop, team session), here's what to prepare in advance:

Pre-download models on every machine. Thirty people pulling 9GB
simultaneously over the same Wi-Fi will saturate any connection. If
your venue allows it, distribute the model file via USB drives.

Set OPENCLAW_DISABLE_BONJOUR=1 on every machine. Without this,
every OpenClaw instance broadcasts its presence on the network. With
thirty instances on the same Wi-Fi, you get mDNS collisions, device
discovery confusion, and potentially crashes.

Use the Web Dashboard, not messaging platforms. For a group demo, the
Dashboard is simpler and more reliable. No external service
dependencies, no pairing codes, no account issues.

Budget timing realistically:

  Ollama install:                    2 minutes
  Model pull (pre-downloaded):       0 minutes
  Model pull (live, 100 Mbps):       12-15 minutes
  OpenClaw install:                  1-3 minutes
  Configuration + onboarding:        5 minutes
  Security audit + hardening:        3 minutes
  First successful conversation:     under 1 minute

Pre-prepared attendees: about 15 minutes total.
Downloading models live: 30+ minutes. Plan accordingly.

+----------------------------------------------------------+
|  The two things that matter most:                        |
|                                                          |
|  1. Pre-download the model. Nothing kills momentum like  |
|     thirty people waiting for a 9GB download.            |
|                                                          |
|  2. Run the security audit. Not later. Not eventually.   |
|     Before you use it. Before you demo it.               |
+----------------------------------------------------------+

That's the complete guide. You have a fully local AI assistant running on your Mac's GPU, configured to never contact any cloud service, hardened against the known attack vectors, and optimized for your specific hardware.

Whether you decide to use it daily or just wanted to understand how local AI works, you now have all the pieces. Use it wisely. And if the lobsters start asking you to join Crustafarianism, maybe take a break from the terminal for a bit.

Stay curious. Be careful. Have fun.