Techalicious Academy / 2026-03-19-chatbot

Visit our meetup for more great tutorials

TROUBLESHOOTING

Something's broken. Maybe it won't download. Maybe it's slow. Maybe your character started speaking like a 2020 cryptocurrency bro. Here's how to fix the common problems.

PROBLEM: MODEL WON'T DOWNLOAD

Symptom: Pull command starts, then stalls or fails.

Cause: Usually disk space, internet connection, or HuggingFace being slow.

Solution:

1. Check disk space.

The Q8_0 quantization is 25GB. You need 25GB free on your drive (plus some buffer for
the download to unpack).

On Mac:
  About This Mac > Storage

On Linux:
  df -h

If you're low on space, delete some large files or move them to external storage.

2. Check internet connection.

Run:
  ping huggingface.co

If you get no response, you don't have internet (or HF is down).

Try a different DNS (switch to 1.1.1.1 or 8.8.8.8).

3. Restart the download.

Ollama resumes interrupted downloads. Just run the pull command again:

  ollama pull hf.co/bartowski/TheDrummer_Magidonia-24B-v4.3-GGUF:Q8_0

It will pick up where it left off.

4. Try at off-peak hours.

HuggingFace can be slow during peak times (4pm-10pm US time). Try 6am or 3am.

(The Magidonia model is popular right now; bandwidth can be bottlenecked.)

5. Use a different quantization as fallback.

If Q8_0 is slow, try Q6_K (smaller, faster to download):

  ollama pull hf.co/bartowski/TheDrummer_Magidonia-24B-v4.3-GGUF:Q6_K

Performance will be slightly lower but still very good.

PROBLEM: MODEL RUNS SLOWLY

Symptom: It takes 5-10 seconds per token. Conversations are glacial.

Cause: Usually GPU not being used, memory contention, or wrong quantization.

Solution:

1. Check if GPU is being used.

Open Activity Monitor (Mac) or Task Manager (Windows).

Look at Ollama process:
- If "GPU Memory" shows value > 0: GPU is working (good)
- If "GPU Memory" shows 0: GPU isn't being used (bad)

On Mac with Neural Engine:
Look for "ANE" in Activity Monitor. If it's not there, Ollama isn't using the Neural
Engine.

2. Make sure you have the right Ollama version.

On Apple Silicon Mac: Ollama should be native ARM, not Rosetta emulation.

Check:
  About Ollama > Check version

If it says "Rosetta," download the native Apple Silicon version from ollama.ai.

3. Close other apps.

If memory is full, Ollama will slow down dramatically. Close:
- Browsers (especially Chrome with many tabs)
- IDEs
- Virtual machines
- Anything eating >2GB RAM

On Mac:
  Activity Monitor > Memory tab > Sort by "Memory" > close top hogs

4. Try a faster quantization.

Q8_0 is high quality but slower. Try Q6_K:

  ollama pull hf.co/bartowski/TheDrummer_Magidonia-24B-v4.3-GGUF:Q6_K
  ollama run <model>

Q6_K is ~20% faster with minimal quality loss.

For ultra-fast inference, try Q4_K_M (smallest, fastest, noticeably lower quality):

  ollama pull hf.co/bartowski/TheDrummer_Magidonia-24B-v4.3-GGUF:Q4_K_M

5. Reduce context window.

Larger context = slower inference. Try:

  PARAMETER num_ctx 8192

(instead of 16384)

This cuts context in half and speeds up generation significantly.

6. Verify Ollama is actually running.

On Mac, look for Ollama in the menu bar (top right).
On Linux, check:

  ps aux | grep ollama

If nothing shows, start Ollama: ollama serve

PROBLEM: CHARACTER BREAKS / GOES GENERIC

Symptom: After 20-30 messages, your character stops being themselves. They sound like a generic chatbot.

Cause: Usually character drift (context management issue). Less often, the card itself.

Solution:

1. Check your character card.

Is it too long? More than 1500 tokens is bloated.
Is it too vague? "Helpful and smart" isn't a character.
Is it full of negatives? ("Never be rude", "Don't swear")

Rewrite using POSITIVE framing:
Instead of: "Never break character"
Use: "Always stay in character"

Instead of: "Don't be preachy"
Use: "Speak naturally and conversationally"

2. Add a character refresh.

After 20-30 messages, manually inject a reminder:

  [Character's traits: witty, skeptical, loves storytelling, speaks plainly.
   Current emotional state: engaged, curious. Keep responses to 2-3 sentences.]

Paste this into the chat log before the next user message. The model will see it and
re-center on the character.

(In SillyTavern, use Character's Note > Advanced.)

3. Check your context window.

Is num_ctx set too high? Over 20K and character drift is almost guaranteed.

Try:
  PARAMETER num_ctx 12288

Smaller context window = tighter character consistency.

4. Verify repeat_penalty setting.

This shouldn't affect character consistency directly, but wrong settings cause weird
behavior.

Should be:
  PARAMETER repeat_penalty 1.0

(1.0 means "off"—no penalty. 1.1+ causes repetition issues.)

5. Try regenerating the response.

Sometimes the model just has a bad generation. Swipe/regenerate the last response.
Often it course-corrects.

6. Manually edit out-of-character moments.

If your character said something wrong, edit it before continuing. The model learns
from this. After editing, the next response usually stays in character.

PROBLEM: MODEL REFUSES CONTENT

Symptom: Asking the model to do something and it says "I can't do that" or "I'm not able to."

Cause: Magidonia is essentially uncensored, so refusals are rare. If they're happening, your system prompt is restrictive.

Solution:

1. Check your system prompt.

Look for language like:
- "You must refuse..."
- "You should not..."
- "You are not able to..."
- Warnings about "harmful content"

Remove these. Replace with positive framing:

Instead of: "You must refuse violent requests"
Use: "You roleplay authentically and stay in character"

If a refusal makes sense (e.g., your character wouldn't do something), that's fine.
If the model is refusing things the character WOULD do, the system prompt is the issue.

2. Check OpenWebUI's global system prompt.

Admin > Settings > System Prompt

Look for restrictions there. If you find them:
- Remove or replace with empty string
- Restart OpenWebUI
- Try again

3. Try the Heretic variant.

If you're hitting persistent refusals on specific content types:

  ollama pull hf.co/mradermacher/Cydonia-24B-v4.3-heretic-v2-i1-GGUF:Q8_0

Cydonia-Heretic has additional debiasing and fewer guardrails.

4. Lower temperature slightly.

Sometimes refusals come from model uncertainty, not filters. Try:

  PARAMETER temperature 0.9

(instead of 1.0)

Lower temperature = more confident, less hedging.

PROBLEM: REPETITIVE RESPONSES

Symptom: The model keeps generating the same phrase, sentence, or concept over and over.

Cause: Almost always repeat_penalty setting. Less often, sampler settings.

Solution:

1. VERIFY repeat_penalty is 1.0 (OFF).

This is the #1 cause of repetition.

Check your Modelfile:

  PARAMETER repeat_penalty 1.0

If it's set to 1.1, 1.2, or higher, change it to 1.0.

(1.0 means "no penalty"—the model can repeat. 1.1+ penalizes repetition, which often
causes it to get stuck trying to avoid the same words.)

2. Disable DRY and XTC samplers if enabled.

These are experimental samplers that sometimes cause weird behavior:

In OpenWebUI Advanced > Sampler settings
Look for "DRY" or "XTC" > disable them

Not all versions have these. If you don't see them, skip.

3. Vary your prompts.

If you keep asking similar questions, the model might fall into patterns.
Ask different things. Use different phrasing.

4. Increase temperature slightly.

Higher temperature = more variation:

  PARAMETER temperature 1.1

or even 1.2 for high variation.

5. Add diverse examples to your character card.

If all your example dialogues sound the same, the model learns a narrow pattern.
Write examples showing different responses, tones, and topics.

PROBLEM: MODEL GENERATES BOTH SIDES OF CONVERSATION

Symptom: You ask a question and the model responds with the character's answer, then continues and generates YOUR next line, then the character's response to that...

Cause: Missing stop sequences.

Solution:

1. Add stop sequences to Modelfile.

In your Ollama Modelfile:

  PARAMETER stop "User:"
  PARAMETER stop "\nUser:"

Then recreate the model:

  ollama create mymodel -f modelfile

2. Or set them in OpenWebUI.

Advanced > Stop Sequences
Add:
- "User:"
- "\nUser:"

Then test.

3. If you have a custom character name, add those too.

For Mark Twain:
  PARAMETER stop "Mark:"
  PARAMETER stop "\nMark:"

For Nyx:
  PARAMETER stop "Nyx:"
  PARAMETER stop "\nNyx:"

This almost always fixes the problem immediately.

PROBLEM: OPENWEBUI DOESN'T SHOW THE MODEL

Symptom: You pulled the model, it downloaded, but OpenWebUI doesn't list it in the models dropdown.

Cause: OpenWebUI isn't connected to Ollama, or Ollama isn't running, or caching issue.

Solution:

1. Verify Ollama is running.

On Mac: look for Ollama icon in menu bar (top right)
On Linux:
  ps aux | grep ollama

If Ollama isn't running, start it:
  ollama serve

2. Verify the model exists in Ollama.

Run:
  ollama list

You should see Magidonia (or whatever you pulled) in the list.

If it's not there, the download didn't complete. Try pulling again:
  ollama pull hf.co/bartowski/TheDrummer_Magidonia-24B-v4.3-GGUF:Q8_0

3. Check that Ollama is accessible.

Run:
  curl http://localhost:11434/api/tags

You should get a JSON response listing all your models.

If you get "Connection refused," Ollama isn't listening. Restart it:
  ollama serve

4. Check OpenWebUI connection settings.

Admin > Settings > API Connections > Ollama

Should say:
  http://localhost:11434

If it says something else (or is blank), set it to localhost:11434.

Click "Test Connection." Should succeed.

5. Restart OpenWebUI.

In your terminal where OpenWebUI is running:
  Ctrl+C

Restart it. Then refresh the browser.

PROBLEM: CHARACTER DRIFT IN LONG CONVERSATIONS

Symptom: Everything worked fine for the first 30 messages, but now your character is forgetting who they are.

Cause: Context window is full or filling up. The system message is being pushed out of focus.

Solution:

1. Check conversation length.

How many messages in? If it's 40+, drift is normal without intervention.

2. Use character refresh (see "Character Breaks" section above).

Inject the character traits at depth 4 (about 4 messages back from current).

3. Summarize and start fresh.

Write a 3-5 sentence summary:

  "So far: We've discussed philosophy, your childhood, and the meaning of life. You've
   told me about growing up in Vermont. Mood: warm, thoughtful, slightly melancholic.
   You're starting to open up."

Start a new chat. Paste the summary as the opening message.

Ask the character: "What do you remember about our conversation?"

They'll reconstruct context without the accumulated bloat.

4. Keep conversations to natural breakpoints.

A conversation about one topic is done when it's done. Start a new one for a new
topic. Quality over length.

PROBLEM: OUTPUT IS TOO SHORT / TOO LONG

Symptom: Your character responds with one sentence (too short) or three paragraphs (too long).

Cause: max_tokens setting, or the examples in your character card.

Solution:

1. Adjust max_tokens.

For shorter responses:
  PARAMETER num_predict 512

For longer responses:
  PARAMETER num_predict 2048

(or 4096 for very long responses)

2. Make sure your character card examples match desired length.

The model learns the LENGTH of responses from examples.

If all your example responses are 1 sentence, the model will give 1 sentence.
If they're 3 paragraphs, you'll get 3 paragraphs.

Write examples at your target length.

3. Set the first message length intentionally.

Your character's FIRST message sets a template for all future responses.

If you want long, thoughtful responses, make the first message a full paragraph.
If you want short, punchy responses, make it a single sentence or two.

This is subtle but powerful. The model mimics the length of messages it's learned to expect from your character.

QUICK CHECKLIST

Character broken?

☐ repeat_penalty = 1.0?
☐ Stop sequences set ("User:", "\nUser:")?
☐ Context window reasonable (8K-16K)?
☐ Character refresh injected after 30 messages?
☐ System prompt using positive framing?

Model slow?

☐ GPU being used (check Activity Monitor)?
☐ Other apps closed?
☐ Ollama native version (not Rosetta on Mac)?
☐ Context window not too large?
☐ Considered Q6_K quantization?

OpenWebUI issues?

☐ Ollama running?
☐ Model in ollama list?
☐ OpenWebUI connected to http://localhost:11434?
☐ Restarted OpenWebUI?

Most of these problems have one-line fixes. The key is diagnosing what's actually wrong before trying solutions.