PARAMETERS EXPLAINED

When you create a custom model in the Workspace, there's an Advanced Parameters section with a bunch of numbers you can tweak. These control HOW the model generates text, not what it knows.

Think of the system prompt as the model's job description. Parameters are its work style. Same person, same job, but do they play it safe or take creative risks? Do they ramble or keep it tight?

Where to Find Parameters

Two places:

Workspace > Models > Create/Edit > Advanced Parameters These apply to every conversation using that custom model.
In any chat, click the settings gear icon near the model dropdown. These apply to the current conversation only.

Option 2 is great for experimenting. Change a parameter, send a message, see the difference. Once you find settings you like, bake them into a custom model.

Temperature

This is the big one. Temperature controls randomness in the model's word choices.

When the model generates each word, it calculates probabilities for every possible next word. Temperature determines how much those probabilities matter.

Low temperature (0.1 to 0.3):
The model almost always picks the highest-probability word. Output
is focused, predictable, and consistent. Ask the same question
twice, get nearly the same answer.

Medium temperature (0.5 to 0.7):
A balance. The model usually picks likely words but occasionally
surprises you. Good default range for most tasks.

High temperature (0.8 to 1.5):
The model is more willing to pick less obvious words. Output
becomes more creative, varied, and sometimes weird. Ask the same
question twice, get noticeably different answers.

Very high (above 1.5):
Output starts to become incoherent. The model picks wildly
improbable words and the text falls apart. Don't go here unless
you're experimenting.

The default is usually 0.7 or 0.8.

When to adjust:

Writing code?            Drop to 0.2 or 0.3. You want precision.
Answering factual questions?  Drop to 0.3 to 0.5.
Creative writing?        Push to 0.8 to 1.0.
Brainstorming ideas?     Push to 1.0 to 1.2.
Roleplay or fiction?     Push to 0.8 to 1.0.

The easiest way to understand it: temperature 0 means "always pick the single most likely next word." Temperature 2 means "surprise me."

Top P (Nucleus Sampling)

Top P is another way to control randomness, but it works differently from temperature.

Instead of scaling probabilities like temperature does, Top P cuts off the candidate list. It takes the most probable words, adds up their probabilities, and stops when the total reaches P.

Top P = 0.1:
Only considers the very top candidates whose probabilities add
up to 10%. Very focused. Very repetitive.

Top P = 0.9:
Considers candidates adding up to 90% probability. A wide pool
of reasonable words. More varied output.

Top P = 1.0:
Considers everything. Effectively disabled.

The default is usually 0.9.

Here's the practical difference from temperature. Temperature makes unlikely words MORE likely. Top P removes unlikely words entirely. Temperature is a volume knob; Top P is a filter.

+-------------------------------------------------------+
|  RULE OF THUMB                                        |
|                                                       |
|  Adjust temperature OR Top P, not both at the same    |
|  time. They interact in unpredictable ways. Pick one  |
|  to experiment with and leave the other at default.   |
+-------------------------------------------------------+

Top K

Top K is simpler than Top P. It limits the model to choosing from the K most probable next words, period.

Top K = 10:   Only the 10 most likely words are candidates
Top K = 40:   40 candidates (common default)
Top K = 100:  100 candidates

The difference from Top P is that Top K always uses a fixed number of candidates regardless of their probabilities. Top P uses a variable number based on probability mass.

In practice, Top P is more flexible because it adapts to the situation. Sometimes only 5 words are reasonable candidates; sometimes 200 are. Top P handles both cases naturally. Top K doesn't.

Most people leave Top K at 40 and adjust temperature or Top P instead.

Repeat Penalty

Models sometimes get stuck in loops, repeating the same phrases or words over and over. Repeat penalty discourages this.

1.0:    No penalty. Repetition is allowed.
1.1:    Mild penalty. Reduces obvious loops.
1.2:    Moderate. Actively avoids recently used words.
1.5+:   Aggressive. Can hurt coherence because the model is
        trying too hard to use different words.

The default is usually 1.1, which is a good starting point.

There's a related parameter called Repeat Last N that controls how far back the model looks when checking for repetition. Default is typically 64 tokens. The model penalizes repeating any word that appeared in the last 64 tokens.

Context Length (num_ctx)

This is how many tokens of conversation the model can "see" at once. It's the model's short-term memory.

2048 tokens:   About 1,500 words of history
4096 tokens:   About 3,000 words
8192 tokens:   About 6,000 words
32768 tokens:  About 24,000 words

If a conversation exceeds the context length, the oldest messages get pushed out. The model literally cannot see them anymore. It's not forgetting. Those messages no longer exist in its working memory.

+-------------------------------------------------------+
|  WARNING                                              |
|                                                       |
|  Larger context = more RAM used. Doubling context     |
|  roughly doubles the memory used for the KV cache.    |
|  A model that needs 8GB at 4K context might need      |
|  16GB+ at 32K context. Don't crank this up without    |
|  checking your available memory.                      |
+-------------------------------------------------------+

In OpenWebUI, context length is set per-model in the Advanced Parameters section. If you need a larger context window than the Ollama default, you have two options:

Create a Modelfile in Ollama with a higher num_ctx (we covered this in a previous class)
Set it directly in the custom model parameters in OpenWebUI

Option 2 is easier for most people.

Max Tokens (num_predict)

This limits how long the model's response can be. It's not about the conversation length (that's context). It's about a single response.

256:    Short responses only
1024:   Medium length
4096:   Long, detailed responses
-1:     No limit (model decides when to stop)

The default is usually -1 or model-dependent. Set this if you want to keep responses concise or if the model tends to ramble.

Seed

Setting a seed makes the model's output reproducible. Same seed, same prompt, same parameters, same output every time.

Useful for testing and comparison. If you're tweaking parameters and want to see the effect of a single change, set a seed so everything else stays constant.

Leave it at 0 or blank for normal use. You want variety in regular conversations.

Stop Sequences

These are strings that tell the model to stop generating. When the model is about to output one of these strings, it stops immediately.

Useful for roleplay (stop when the model tries to generate the user's response) or for structured output (stop at a specific delimiter).

Min P

A newer alternative to Top P and Top K. Instead of a fixed cutoff, Min P sets a minimum probability threshold relative to the most likely token.

If the most likely next word has probability 0.4, and Min P is 0.05, then any word with probability below 0.02 (0.4 times 0.05) gets filtered out. The threshold adapts to each situation.

This is gaining popularity because it's more intuitive than Top K and more adaptive than a fixed Top P. Try 0.05 to 0.1 as a starting point if you want to experiment.

Mirostat

Mirostat is an algorithm that automatically adjusts randomness to maintain a target "perplexity" (creativity level) throughout the entire response. Instead of you setting a fixed temperature, Mirostat dynamically adapts as it generates.

Mirostat 0:   Disabled (use temperature/top-p instead)
Mirostat 1:   Original algorithm
Mirostat 2:   Improved version (generally recommended if using it)

Related settings:

Mirostat Tau:   Target creativity (lower = more focused)
Mirostat Eta:   How fast it adapts (default 0.1 is fine)

When Mirostat is enabled, Top P and Top K are typically ignored. Most people either use temperature/top-p OR Mirostat, not both.

Recommended Starting Points

For coding tasks:

Temperature: 0.2
Top P: 0.9
Repeat Penalty: 1.1

For general conversation:

Temperature: 0.7
Top P: 0.9
Repeat Penalty: 1.1

For creative writing:

Temperature: 0.9
Top P: 0.95
Repeat Penalty: 1.15

For factual Q&A:

Temperature: 0.3
Top P: 0.85
Repeat Penalty: 1.0

These are starting points, not rules. Experiment. The best settings depend on the specific model and your personal preferences.

Next Up

Let's try something visual. We'll use a vision model to analyze images directly in the chat.