The Settings That Change Your Output
Temperature, top_p, max tokens, stop sequences, and the penalties, and when to touch each
Before you rewrite a prompt for the fifth time, check the dials around it. Every API call has a handful of sampling settings, and they change your output as much as the wording does. People leave them at default, get inconsistent results, and blame the prompt. This post is the second part of Prompt Engineering, Properly, and it covers each setting, what it actually does, and when it is worth touching.
The short version: most of the time you change one or two of these, not all of them. Knowing which one a given problem calls for is the whole skill.
Temperature: how much randomness you allow#
Temperature controls how the model picks the next token. At a low temperature the model almost always takes the highest-probability token, so the output is close to deterministic and repeatable. Raise the temperature and you give the other plausible tokens more weight, which means more variety and more surprise.
The rule I use: low temperature for anything where there is a right answer, higher temperature for anything where there is not. Fact-based question answering, extraction, classification, code: keep it low so the model stays focused and consistent. Brainstorming, marketing copy, poem generation, naming: raise it so you get range instead of the same safe answer every time.
# Factual extraction: stay tight and repeatable.
client.responses.create(model="...", input=prompt, temperature=0.1)
# Brainstorming taglines: let it roam.
client.responses.create(model="...", input=prompt, temperature=0.9)If you are evaluating a prompt and the output changes every run, drop the temperature before you conclude the prompt is bad. You want one variable moving at a time, and a high temperature adds noise that hides whether your wording change actually helped.
Top_p: the other way to control randomness#
Top_p, also called nucleus sampling, is a different knob for the same general thing. Instead of reshaping the whole distribution like temperature does, it limits which tokens are even eligible. With top_p = 0.1, the model only considers the most probable tokens that together make up the top 10% of probability mass. Everything in the long tail is off the table. A high top_p lets the model reach into less likely words, which gives more diverse output.
So low top_p means confident, focused answers; high top_p means more variety. That is the same direction temperature moves, which is exactly why the standard advice is to change temperature or top_p, not both. If you move both you cannot reason about what you did. Pick one as your randomness dial and leave the other at its default.
Max length: a budget, not a target#
Max length (often max_tokens) caps how many tokens the model generates. It does two jobs. It stops runaway responses, and it controls cost, since you pay per output token. I covered the cost side in Cutting LLM cost and latency.
One thing to watch: this is a hard cutoff, not a "be concise" hint. If you set it too low, the model gets chopped off mid-sentence rather than wrapping up early. If you want short output, say so in the prompt ("answer in two sentences") and use max length as a safety ceiling above that, not as the thing that does the shortening.
Stop sequences: end exactly where you want#
A stop sequence is a string that tells the model to stop generating the moment it produces it. It is the cleanest way to control the shape of an output without trimming text yourself afterward.
The classic use is bounding a list. Want at most ten items? Add "11." as a stop sequence and the model halts before it can write an eleventh.
client.responses.create(
model="...",
input="List the top items, numbered:\n1.",
stop=["11."], # stop before the 11th item
)Stop sequences are also how you keep a model from running past the part you care about, for example stopping at "\n\n" so it returns one paragraph, or at a delimiter you use to separate sections. When you are parsing the output downstream, a stop sequence is more reliable than hoping the model stops on its own.
The two penalties: frequency and presence#
These both fight repetition, and they differ in how.
Frequency penalty scales with count. The more times a token has already appeared (in the prompt and the response so far), the harder it gets penalized. This is good for stopping the model from leaning on the same word over and over in long output.
Presence penalty is flat. A token that appeared once and a token that appeared ten times get the same penalty just for having shown up at all. This nudges the model toward introducing new words and topics rather than circling the ones already on the page. Turn it up when you want range; keep it low when you want the model to stay on a narrow subject.
Same as temperature and top_p, the guidance is to adjust one of these, not both, so you can tell what changed. In practice I reach for the penalties rarely. They matter most for longer creative generation where repetition creeps in. For short, structured tasks they are usually not the problem.
A quick reference#
| Setting | What it does | Turn it up when | Turn it down when |
|---|---|---|---|
| Temperature | Reshapes randomness across all tokens | You want creativity, variety | You want factual, repeatable answers |
| Top_p | Limits which tokens are eligible (nucleus) | You want more diverse wording | You want focused, confident answers |
| Max length | Caps generated tokens | Output is getting cut off too early | You need to bound cost or runaway text |
| Stop sequence | Halts generation at a string | (not a dial; set it to bound structure) | |
| Frequency penalty | Penalizes tokens by how often they appear | A word keeps repeating in long output | Repetition is fine or wanted |
| Presence penalty | Flat penalty for any repeated token | You want new topics introduced | You want the model to stay on one subject |
Two pairings move in the same direction, so change only one of each: temperature or top_p, and frequency or presence. And remember results vary by model and version. A temperature that feels right on one model is a starting point on another, not a setting you can copy blindly.
Wrapping up#
The dials are not a substitute for a good prompt, but they are the first thing to check when output is inconsistent, too long, or too repetitive, because each problem maps to a specific setting. Inconsistent: lower temperature. Too long or expensive: max length and stop sequences. Repetitive: a penalty. Fix the dial before you rewrite the prose.
Next in the series: The anatomy of a good prompt, where we move from the settings around the prompt to the structure inside it. The previous part, on why prompting still matters, is here.
Source: the Prompt Engineering Guide, LLM Settings.

Folarin Akinloye is an AI Engineer based in London, UK. He builds production-ready agentic AI systems, multi-agent architectures, and sophisticated RAG implementations, and writes about the engineering decisions behind them.