CONTEXT WINDOWS - CRITICAL SETUP

This chapter is not optional reading. If you skip this, Claude Code will barely function. Ollama has a default that makes coding tools nearly useless, and you need to fix it.

The Problem Nobody Tells You

Here's what happens out of the box:

qwen3-coder-next supports:    256,000 tokens
Ollama default:               4,096 tokens

Read that again. The model can handle 256K tokens. Ollama gives it 4K.

That's 1.6% of the model's capability. It's like buying a sports car and driving it in first gear.

With 4K context, Claude Code can barely function:

System prompt eats ~1,000 tokens
Your CLAUDE.md eats ~500 tokens
One medium file eats the rest
No room for conversation history

You'll see symptoms like:

Model "forgets" what you just said
Responses that ignore your files
Confused, incomplete answers
"I don't have access to that file" (it was truncated)

Why This Happens

Ollama uses a conservative default (4K) because:

Works on any hardware
Uses minimal RAM
Prevents out-of-memory crashes

But for coding tools, 4K is worthless. You MUST increase it.

The Fix: Create a Custom Model

This is not optional. Do this before using Claude Code.

Step 1: Create a Modelfile

nano ~/Modelfile-qwen-claude

Paste this content:

FROM qwen3-coder-next:latest
PARAMETER num_ctx 32768

Save and exit.

Step 2: Create the custom model

ollama create qwen3-coder-32k -f ~/Modelfile-qwen-claude

This creates a new model called "qwen3-coder-32k" with 32K context.

Step 3: Update your settings.json

Change your model name:

{
  "env": {
    "ANTHROPIC_BASE_URL": "http://10.0.0.79:11434",
    "ANTHROPIC_AUTH_TOKEN": "ollama",
    "ANTHROPIC_MODEL": "qwen3-coder-32k",
    "ANTHROPIC_SMALL_FAST_MODEL": "qwen3-coder-32k",
    "CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC": "1",
    "API_TIMEOUT_MS": "600000"
  }
}

Now you're using the model with proper context.

Why Claude Code Can't Fix This

You might wonder: can't Claude Code just ask for more context?

No. Here's why:

Claude Code talks to Ollama via the Anthropic-compatible API endpoint (/v1/messages). This endpoint uses Anthropic's API format. Anthropic's API has no "num_ctx" parameter - context is handled server-side.

Claude Code                    Ollama
     |                            |
     |  POST /v1/messages         |
     |  {                         |
     |    "model": "...",         |
     |    "messages": [...]       |
     |  }                         |
     |--------------------------->|
     |                            |
     |  (no num_ctx field!)       |
     |                            |

Ollama receives the request, but there's nowhere for Claude Code to specify context size. So Ollama uses whatever default the model has.

If you use the base model (qwen3-coder-next:latest), that default is 4K. If you use your custom model (qwen3-coder-32k), that default is 32K.

The Modelfile bakes the context size INTO the model definition. It's the only reliable way to control this.

How Much Context Do You Need?

More context = more RAM. Here's the trade-off:

Context Size    Additional RAM    Good For
-------------------------------------------------------
8K tokens       ~2GB              Light use, small files
16K tokens      ~4GB              Normal coding sessions
32K tokens      ~8GB              Multiple files, longer chats
64K tokens      ~16GB             Large codebases
128K tokens     ~32GB             Massive context needs

This is ON TOP of the model weights (~20GB for qwen3-coder-next).

Practical recommendations:

16GB unified memory:  Use 8K context
32GB unified memory:  Use 16K-32K context
64GB unified memory:  Use 32K-64K context
96GB unified memory:  Use 64K+ context

Creating Models for Different Scenarios

You might want multiple models for different situations:

Light/fast model (8K):

FROM qwen3-coder-next:latest
PARAMETER num_ctx 8192

ollama create qwen3-coder-8k -f Modelfile-8k

Standard model (32K):

FROM qwen3-coder-next:latest
PARAMETER num_ctx 32768

ollama create qwen3-coder-32k -f Modelfile-32k

Heavy model (64K):

FROM qwen3-coder-next:latest
PARAMETER num_ctx 65536

ollama create qwen3-coder-64k -f Modelfile-64k

Switch between them in settings.json as needed.

Verifying Your Context Size

After creating your model, verify it worked:

ollama run qwen3-coder-32k

Then in another terminal:

ollama ps

You should see:

NAME                SIZE      PROCESSOR  CONTEXT
qwen3-coder-32k     20 GB     100% GPU   32768

That CONTEXT column confirms your setting took effect.

What Happens When Context Fills Up?

Even with 32K context, long sessions fill up. When that happens:

1. TRUNCATION - Old messages get dropped

The model stops seeing early conversation. It "forgets" what you
discussed 30 minutes ago. Important context disappears.

2. SYMPTOMS

Repeating questions you already answered
Forgetting file contents it read earlier
Losing track of the task
Contradicting earlier statements

3. CLAUDE CODE'S RESPONSE

Claude Code tries to manage this intelligently:
- It tracks token usage internally
- It may summarize old content
- It keeps recent/relevant info

But there's no magic. Eventually things get lost.

Checking Context Usage

Inside Claude Code:

/context

This shows:

Current usage percentage
What's loaded (CLAUDE.md, files, etc.)
Warnings if getting full

Compacting Context

When context gets full, compress it:

/compact

This summarizes the conversation and clears old messages. You lose detail but free up space for new work.

Do this PROACTIVELY:

After completing a major task
When /context shows 70%+
Before starting a new topic

Don't wait until things break.

Auto-Compaction

You can set Claude Code to auto-compact at a threshold.

In settings.json:

"env": {
  "CLAUDE_AUTOCOMPACT_PCT_OVERRIDE": "75",
  ...other settings...
}

This triggers automatic compaction at 75% usage.

The Paradox: Bigger Isn't Always Better

Research shows that instruction-following actually DEGRADES with very large contexts. The model has more to pay attention to, so important information gets diluted.

With a 128K context:

Your CLAUDE.md at the top is 128K tokens away from current message
The model's attention is spread thin
Character drift and instruction-forgetting increase

Sweet spot for most coding: 16K-32K

Only go higher if you genuinely need to load massive files. For normal coding sessions, 32K is plenty.

Starting Fresh

Sometimes the best solution is a new session:

Ctrl+D to exit
claude to restart

This gives you:

Full context available
Clean slate
No accumulated confusion

New topic? New session. Don't try to do everything in one conversation.

Summary

THE CRITICAL POINT:
Ollama defaults to 4K context. This is way too small.
You MUST create a custom Modelfile. This is not optional.

STEPS:
1. Create Modelfile with num_ctx parameter
2. Run: ollama create modelname -f Modelfile
3. Update settings.json with new model name
4. Verify with: ollama ps

RECOMMENDED SIZES:
16GB RAM -> 8K context
32GB RAM -> 16-32K context
64GB RAM -> 32-64K context

MANAGEMENT:
/context     Check usage
/compact     Compress when full
New session  When switching topics

WHY MODELFILE IS REQUIRED:
Claude Code uses Anthropic API format
Anthropic API has no num_ctx parameter
Only way to set context is in the model definition