Techalicious Academy / 2026-03-19-chatbot

Visit our meetup for more great tutorials

QUANTIZATION GUIDE

Before we download Magidonia, we need to talk about quantization. This is where hardware meets model, and where you make a choice that affects everything downstream.

WHAT IS QUANTIZATION?

In simple terms: quantization is compression. Your model has millions of numbers (weights) that the neural network uses to think. These weights start as full-precision floating point numbers (FP16). That's 2 bytes per weight.

Quantization shrinks those numbers down. You can store them as 4-bit, 6-bit, 8-bit, whatever. Smaller representation = smaller file = less RAM needed = faster loading.

The tradeoff: you lose some precision. The question is how much. And for Magidonia, the answer is: surprisingly little if you quantize right.

THE SPECTRUM

For Magidonia-24B, here's the main quantization options and what they cost you:

Q4_K_M    ~14.3 GB   ~90% quality vs FP16    - Unnecessary compromise at 96GB
Q5_K_M    ~16.8 GB   ~94% quality            - Still leaving quality on the table
Q6_K      ~19.3 GB   ~98% quality            - Excellent second choice
Q8_0      ~25.0 GB   ~99.5% quality          - YOUR PICK. Near-lossless.

Those percentages aren't made up. They're derived from community testing, looking at output coherence, consistency, and creative quality across thousands of generations.

At 90% quality, users notice the difference. The model makes weird jumps, forgets details, invents details that contradict earlier statements. At 98%? You'd need a blind test to spot it. At 99.5%? You're splitting hairs.

WHY Q8_0 FOR YOUR 96GB MAC

You have a 96GB Mac. The Q8_0 version is 25GB. That leaves you 71GB of headroom.

Why do you need headroom? The model itself takes 25GB in VRAM. But when you're generating text, the system builds something called the KV cache. That's extra memory for attention computations. For a 24B model with a long context window, that cache can be substantial.

With 25GB model + 71GB headroom, you can comfortably support 32,000+ token context windows without swapping to disk. Swap is death for latency. Everything slows down.

If you went Q6_K (19.3GB), you'd have 76.7GB headroom instead. Slightly more. Not worth the quality loss.

If you went Q4_K_M (14.3GB), you'd have 81.7GB headroom. But your model would be noticeably worse at coherence. Swapping is bad, but incoherent output is worse.

At your hardware level, Q8_0 is the sweet spot. Full stop.

ABOUT BARTOWSKI'S GGUF QUANTIZATIONS

The bartowski name appears a lot in the quantization world. bartowski quantizes popular models using iMatrix, a technique recommended by both TheDrummer (Magidonia's creator) and the official Ollama documentation.

What's iMatrix? It's an importance matrix. When you quantize, you can't just shrink all weights equally. Some weights matter more than others. iMatrix figures out which ones matter most, preserves them with higher precision during quantization, and quantizes the less important weights more aggressively.

The result: you lose less quality per GB of size reduction than you would with standard quantization.

Think of it like image compression. JPEG vs PNG. JPEG is smart about what to blur and what to keep sharp. That's iMatrix.

The bartowski versions are explicitly recommended on both TheDrummer's HuggingFace page and on the official Ollama documentation as the standard to use. If you're downloading Magidonia, you want the bartowski GGUF. Which you do. That's what we're pulling.

WHEN TO GO LOWER

If you were running multiple models simultaneously, Q6_K becomes more attractive. You could load one at Q8_0 and one at Q6_K, and split the 96GB across both.

If you were on 16GB RAM (a more typical laptop), Q4_K_M is the floor for usable quality. You'd be right at the edge, managing context windows carefully. But for roleplay at that hardware level, the extra compression is necessary.

The community consensus from someone with plenty of VRAM: "On Macs with ample memory, Q8_0 is for when quality matters most and memory isn't a constraint." You have ample memory. Use it.

QUICK REFERENCE TABLE

Quantization Level | File Size | Quality Loss | Use Case -------------------+-----------+--------------+---------------------------- Q4_K_M | 14.3 GB | 10% | 16GB RAM (minimum) Q5_K_M | 16.8 GB | 6% | 24GB RAM (portable) Q6_K | 19.3 GB | 2% | 32GB RAM (good backup) Q8_0 | 25.0 GB | 0.5% | 64GB+ RAM (your choice)

The files differ primarily in weight precision. Q4_K_M uses 4 bits for main weights. Q8_0 uses 8 bits for main weights. More bits = more space = more quality.

MOVING FORWARD

For this tutorial, we're using Q8_0. It's the right choice for your hardware. When you download the model in the next section, you'll see the quantization level in the model name. Look for the one that says "Q8_0". That's the one you want.

Let's download it.