SYSTEM REQUIREMENTS
Before we dive in, let's make sure your hardware can handle local AI. The good news: it's more accessible than you might think.
The Golden Rule: RAM is Everything
Running AI locally is about one thing: memory. The model has to fit in RAM (or VRAM) to run. No exceptions.
Here's the simple math:
Model parameters × Quantization = RAM needed
22B parameters × Q6 (~0.75 bytes/param) = ~16.5GB
22B parameters × Q4 (~0.5 bytes/param) = ~11GB
12B parameters × Q6 (~0.75 bytes/param) = ~9GB
But you also need headroom for:
- The operating system (~4-8GB)
- Context window (conversation history)
- Other applications
So the PRACTICAL requirements are:
RAM Requirements by Model Size
Model Size Min RAM Comfortable Ideal
-------------------------------------------------
7B (Q4) 8GB 16GB 16GB+
12B (Q6) 16GB 24GB 32GB
22B (Q4) 16GB 32GB 48GB+
22B (Q6) 24GB 48GB 64GB+
70B (Q4) 48GB 64GB 96GB+
For tonight's RPMax 22B at Q6:
- Minimum: 24GB (tight, might swap)
- Comfortable: 32GB
- Ideal: 48GB+
If you have 16GB RAM:
- Use 12B model (Mistral-Nemo-12B-ArliAI-RPMax)
- Or use Q4 quantization of 22B (quality loss)
- Or expect slower performance due to swap
Supported Operating Systems
Ollama runs on:
macOS: 10.15+ (Catalina or newer)
Native Apple Silicon support (M1/M2/M3/M4)
Intel Macs work but slower
Linux: Most distributions
NVIDIA GPU support via CUDA
AMD GPU support via ROCm
CPU-only works fine
Windows: Windows 10/11
NVIDIA GPU support
WSL2 works great
Native Windows binary available
Best experience: Apple Silicon Mac or Linux with NVIDIA GPU Still great: Windows with NVIDIA GPU Works fine: Any modern computer with enough RAM
Mac Unified Memory Advantage
Here's why Macs are secretly amazing for local AI:
Traditional computers split memory:
- System RAM (for CPU) - maybe 16GB
- VRAM (for GPU) - maybe 8GB on a decent card
- Model has to fit in ONE of these
Apple Silicon is different:
- "Unified Memory" shared between CPU and GPU
- Your 64GB Mac = 64GB available for the model
- No artificial split
What this means in practice:
Gaming PC with 32GB RAM + RTX 3080 (10GB VRAM):
→ Model limited to 10GB (VRAM) for fast inference
→ Or 32GB (RAM) but slower CPU inference
Mac Studio with 64GB unified memory:
→ Full 64GB available
→ GPU-accelerated inference on entire model
→ Runs 70B models that need expensive GPUs on PC
The M1/M2/M3/M4 chips also have excellent memory bandwidth. AI inference is memory-bound, so this matters a lot.
Practical Mac recommendations:
Mac Mini M4 (16GB): 7B-12B models
Mac Mini M4 (24GB): 12B-22B models (Q4)
Mac Mini M4 Pro (48GB): 22B models comfortably
Mac Studio M2 Max (64GB+): 22B-70B models
Mac Studio M2 Ultra (128GB+): Multiple large models
My setup: Mac Studio M2 Max with 96GB runs 22B Q6 with tons of headroom. Response times are fast, no swapping.
Windows Considerations
Windows works great, especially with NVIDIA GPUs:
NVIDIA GPU (recommended):
- Install CUDA toolkit
- Ollama uses GPU automatically
- RTX 3090/4090 (24GB VRAM) = sweet spot
- RTX 3080/4080 (10-16GB) = good for smaller models
AMD GPU:
- Less mature support
- ROCm works on Linux better than Windows
- CPU fallback is fine
CPU only:
- Works, just slower
- More RAM = better
- 32GB+ recommended for 22B models
For Windows users without big GPUs:
- Stick to 7B-12B models
- Q4 quantization stretches further
- Still great for companions
Linux Considerations
Linux is the power user's choice:
NVIDIA GPU:
- Best supported
- Install NVIDIA drivers + CUDA
- Ollama auto-detects
AMD GPU:
- ROCm support
- Some models work better than others
- Check Ollama docs for compatibility
Server/headless:
- Ollama runs great as a service
- Access via API from other machines
- Perfect for home lab setups
Linux tip: If running headless, start Ollama as a service:
sudo systemctl enable ollama
sudo systemctl start ollama
Checking Your Hardware
Mac:
Apple menu → About This Mac → Memory
Or in terminal:
system_profiler SPHardwareDataType | grep Memory
Windows:
Settings → System → About → Installed RAM
Or in PowerShell:
(Get-CimInstance Win32_ComputerSystem).TotalPhysicalMemory / 1GB
Linux:
free -h
Or for detailed info:
cat /proc/meminfo | grep MemTotal
Context Window and Memory
The context window (conversation history) also uses memory.
16K context ≈ 500MB - 1GB additional
32K context ≈ 1GB - 2GB additional
128K context ≈ 4GB - 8GB additional
This is why we recommend 16K context for companions:
- Enough for long conversations
- Doesn't eat too much RAM
- Keeps character card influential
If you're tight on RAM, reduce context window first. Better to have a working model with shorter memory than a model that swaps to disk.
Performance Expectations
What to expect at different RAM levels (22B Q6 model):
24GB RAM:
- Works but tight
- First response: 5-10 seconds
- May swap occasionally
- Close other apps
32GB RAM:
- Comfortable
- First response: 2-5 seconds
- Stable performance
48GB+ RAM:
- Fast
- First response: 1-3 seconds
- Room for multiple models
64GB+ RAM:
- Excellent
- Sub-second first token
- Can run larger models
These are rough estimates. Actual performance depends on:
- CPU/GPU speed
- Memory bandwidth
- Quantization level
- Context window size
- What else is running
Quick Recommendations
Budget build (under $1000):
- Used Mac Mini M1 (16GB) - runs 12B great
- Or Linux PC with 32GB RAM + used RTX 3080
Mid-range ($1000-2000):
- Mac Mini M4 Pro (24-48GB)
- Or Windows PC with 32GB RAM + RTX 4070
High-end ($2000+):
- Mac Studio M2 Max (64GB+)
- Or Windows/Linux with 64GB RAM + RTX 4090
Enthusiast ($5000+):
- Mac Studio M2 Ultra (128GB+)
- Or multi-GPU Linux workstation
For tonight's workshop:
- If you have 16GB+, you're fine with 12B models
- If you have 32GB+, you can run 22B
- If you have less, follow along and try smaller models
Troubleshooting Memory Issues
If Ollama is slow or crashing:
1. Check what's using RAM:
- Mac: Activity Monitor → Memory
- Windows: Task Manager → Memory
- Linux: htop or top
2. Close unnecessary apps
- Browsers are memory hogs
- Docker containers add up
3. Reduce context window:
- PARAMETER num_ctx 8192 (instead of 16384)
4. Try smaller quantization:
- Q4 instead of Q6
- Loses some quality, gains headroom
5. Try smaller model:
- 12B instead of 22B
- Still great for companions
6. Check for swap usage:
- Mac: Activity Monitor → Memory → Swap Used
- Heavy swap = need more RAM or smaller model
Next: Let's look at the model we're using.