CONTEXT WINDOWS - CRITICAL SETUP
This chapter is not optional reading. If you skip this, Claude Code will barely function. Ollama has a default that makes coding tools nearly useless, and you need to fix it.
The Problem Nobody Tells You
Here's what happens out of the box:
qwen3-coder-next supports: 256,000 tokens
Ollama default: 4,096 tokens
Read that again. The model can handle 256K tokens. Ollama gives it 4K.
That's 1.6% of the model's capability. It's like buying a sports car and driving it in first gear.
With 4K context, Claude Code can barely function:
- System prompt eats ~1,000 tokens
- Your CLAUDE.md eats ~500 tokens
- One medium file eats the rest
- No room for conversation history
You'll see symptoms like:
- Model "forgets" what you just said
- Responses that ignore your files
- Confused, incomplete answers
- "I don't have access to that file" (it was truncated)
Why This Happens
Ollama uses a conservative default (4K) because:
- Works on any hardware
- Uses minimal RAM
- Prevents out-of-memory crashes
But for coding tools, 4K is worthless. You MUST increase it.
The Fix: Create a Custom Model
This is not optional. Do this before using Claude Code.
Step 1: Create a Modelfile
nano ~/Modelfile-qwen-claude
Paste this content:
FROM qwen3-coder-next:latest
PARAMETER num_ctx 32768
Save and exit.
Step 2: Create the custom model
ollama create qwen3-coder-32k -f ~/Modelfile-qwen-claude
This creates a new model called "qwen3-coder-32k" with 32K context.
Step 3: Update your settings.json
Change your model name:
{
"env": {
"ANTHROPIC_BASE_URL": "http://10.0.0.79:11434",
"ANTHROPIC_AUTH_TOKEN": "ollama",
"ANTHROPIC_MODEL": "qwen3-coder-32k",
"ANTHROPIC_SMALL_FAST_MODEL": "qwen3-coder-32k",
"CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC": "1",
"API_TIMEOUT_MS": "600000"
}
}
Now you're using the model with proper context.
Why Claude Code Can't Fix This
You might wonder: can't Claude Code just ask for more context?
No. Here's why:
Claude Code talks to Ollama via the Anthropic-compatible API endpoint (/v1/messages). This endpoint uses Anthropic's API format. Anthropic's API has no "num_ctx" parameter - context is handled server-side.
Claude Code Ollama
| |
| POST /v1/messages |
| { |
| "model": "...", |
| "messages": [...] |
| } |
|--------------------------->|
| |
| (no num_ctx field!) |
| |
Ollama receives the request, but there's nowhere for Claude Code to specify context size. So Ollama uses whatever default the model has.
If you use the base model (qwen3-coder-next:latest), that default is 4K. If you use your custom model (qwen3-coder-32k), that default is 32K.
The Modelfile bakes the context size INTO the model definition. It's the only reliable way to control this.
How Much Context Do You Need?
More context = more RAM. Here's the trade-off:
Context Size Additional RAM Good For
-------------------------------------------------------
8K tokens ~2GB Light use, small files
16K tokens ~4GB Normal coding sessions
32K tokens ~8GB Multiple files, longer chats
64K tokens ~16GB Large codebases
128K tokens ~32GB Massive context needs
This is ON TOP of the model weights (~20GB for qwen3-coder-next).
Practical recommendations:
16GB unified memory: Use 8K context
32GB unified memory: Use 16K-32K context
64GB unified memory: Use 32K-64K context
96GB unified memory: Use 64K+ context
Creating Models for Different Scenarios
You might want multiple models for different situations:
Light/fast model (8K):
FROM qwen3-coder-next:latest
PARAMETER num_ctx 8192
ollama create qwen3-coder-8k -f Modelfile-8k
Standard model (32K):
FROM qwen3-coder-next:latest
PARAMETER num_ctx 32768
ollama create qwen3-coder-32k -f Modelfile-32k
Heavy model (64K):
FROM qwen3-coder-next:latest
PARAMETER num_ctx 65536
ollama create qwen3-coder-64k -f Modelfile-64k
Switch between them in settings.json as needed.
Verifying Your Context Size
After creating your model, verify it worked:
ollama run qwen3-coder-32k
Then in another terminal:
ollama ps
You should see:
NAME SIZE PROCESSOR CONTEXT
qwen3-coder-32k 20 GB 100% GPU 32768
That CONTEXT column confirms your setting took effect.
What Happens When Context Fills Up?
Even with 32K context, long sessions fill up. When that happens:
1. TRUNCATION - Old messages get dropped
The model stops seeing early conversation. It "forgets" what you
discussed 30 minutes ago. Important context disappears.
2. SYMPTOMS
- Repeating questions you already answered
- Forgetting file contents it read earlier
- Losing track of the task
- Contradicting earlier statements
3. CLAUDE CODE'S RESPONSE
Claude Code tries to manage this intelligently:
- It tracks token usage internally
- It may summarize old content
- It keeps recent/relevant info
But there's no magic. Eventually things get lost.
Checking Context Usage
Inside Claude Code:
/context
This shows:
- Current usage percentage
- What's loaded (CLAUDE.md, files, etc.)
- Warnings if getting full
Compacting Context
When context gets full, compress it:
/compact
This summarizes the conversation and clears old messages. You lose detail but free up space for new work.
Do this PROACTIVELY:
- After completing a major task
- When /context shows 70%+
- Before starting a new topic
Don't wait until things break.
Auto-Compaction
You can set Claude Code to auto-compact at a threshold.
In settings.json:
"env": {
"CLAUDE_AUTOCOMPACT_PCT_OVERRIDE": "75",
...other settings...
}
This triggers automatic compaction at 75% usage.
The Paradox: Bigger Isn't Always Better
Research shows that instruction-following actually DEGRADES with very large contexts. The model has more to pay attention to, so important information gets diluted.
With a 128K context:
- Your CLAUDE.md at the top is 128K tokens away from current message
- The model's attention is spread thin
- Character drift and instruction-forgetting increase
Sweet spot for most coding: 16K-32K
Only go higher if you genuinely need to load massive files. For normal coding sessions, 32K is plenty.
Starting Fresh
Sometimes the best solution is a new session:
Ctrl+D to exit
claude to restart
This gives you:
- Full context available
- Clean slate
- No accumulated confusion
New topic? New session. Don't try to do everything in one conversation.
Summary
THE CRITICAL POINT:
Ollama defaults to 4K context. This is way too small.
You MUST create a custom Modelfile. This is not optional.
STEPS:
1. Create Modelfile with num_ctx parameter
2. Run: ollama create modelname -f Modelfile
3. Update settings.json with new model name
4. Verify with: ollama ps
RECOMMENDED SIZES:
16GB RAM -> 8K context
32GB RAM -> 16-32K context
64GB RAM -> 32-64K context
MANAGEMENT:
/context Check usage
/compact Compress when full
New session When switching topics
WHY MODELFILE IS REQUIRED:
Claude Code uses Anthropic API format
Anthropic API has no num_ctx parameter
Only way to set context is in the model definition