LLMs: The Things We’ve Been Overlooking

“What temperature are you using?” If someone asks, what do you say? “The default.” “0.7.” “I don’t know — does it matter?” Most answers fall into one of those three. And if you try to justify the answer, you run out of words fast.

That’s how we use LLMs. We call the APIs every day — stuff prompts into messages, send them off, get responses. But when the question becomes “What does Temperature actually do?”, “How is Top-P different from Temperature?”, “Does Prompt Caching just work if you turn it on?”, “Will hallucinations go away with a better model?” — the answers get fuzzy.

We’ve heard all the names, and we roughly know how to use them. But that’s where it stops. And when something breaks in production in a way you can’t explain, the blame always goes to the prompt or the model, never the things around them. This post is about the things around them — the knobs we’ve been turning without reading the label, and what they actually do to hallucination, context, and everything that happens after generate_text() returns.

Temperature

Temperature is the first parameter most developers encounter when working with LLM APIs. The standard understanding — “low for accuracy, high for creativity” — feels sufficient. It isn’t.

What Temperature Actually Does

When an LLM picks the next token, it assigns a score (logit) \(z_i\) to every candidate in its vocabulary, then converts those logits into a probability distribution via softmax. Temperature \(T\) enters as a divisor on the logits before the softmax:

$$ P(x_i) = \frac{\exp(z_i / T)}{\sum_{j} \exp(z_j / T)} $$

That single division is what reshapes the entire distribution. When \(T < 1\), every logit gets scaled up, which widens the gap between the first-place and second-place tokens — the token the model is “most confident about” wins by a landslide. When \(T > 1\), logits shrink toward each other, the gap compresses, and second- and third-place candidates get a real shot at being selected. As \(T \to 0\), the softmax collapses to an argmax and the model always picks the top-scoring token — a process known as greedy decoding. As \(T \to \infty\), the distribution approaches uniform sampling across the vocabulary.

Think of it like a test with a 95-point answer and an 80-point answer. Low Temperature always picks the 95. High Temperature occasionally says “80 is not bad either” and goes with it.

The Myth of “Temperature 0 = Deterministic Output”

If Temperature 0 always picks the top token, then the same question should always produce the same answer. Theoretically, yes. In practice, through real APIs, no.

Atil et al. (2025) fixed Temperature to 0 across five models — GPT-4o, Llama-3-70B, and others — then repeated the same questions 10 times each. Accuracy fluctuated by up to 15% across runs with identical settings.

The source is the serving infrastructure, not the model itself. Running the same model locally on a GPU with no optimizations produced perfectly deterministic results. But behind an API, multiple users’ requests get batched together (continuous batching, etc.), introducing tiny floating-point arithmetic differences that cascade into divergent outputs.

The common explanation is “floating-point non-associativity + GPU parallelism,” which is only half right. Running the same matrix multiplication with the same data on the same GPU repeatedly produces identical results. Floating-point math and parallelism are involved, but non-determinism doesn’t inherently arise from them alone.

The real cause is the absence of batch invariance. LLM inference servers batch multiple users’ requests together, and the batch size changes the internal reduction order of operations. When batch size changes, the summation order of floating-point additions changes, and due to non-associativity, slightly different results emerge. Server load fluctuates constantly, so the same request gets batched differently every time — and produces different results.

Specific points where this creeps in:

  • Matrix multiplication (MatMul): For small batches, GPUs use Split-K strategies — splitting the reduction dimension to better utilize cores. The split pattern changes with batch size.
  • Attention: During the decode phase, when query length is very short, the KV dimension gets split for parallelization (FlashDecoding). The split granularity depends on the batch. Layer on chunked prefill and prefix caching, and the reduction order shifts for the same tokens.
  • RMSNorm: When the batch is too small, a single row gets distributed across multiple cores. This distribution is also batch-size-dependent.
  • MoE routing: In Mixture-of-Experts models like Mixtral or GPT-4, a router assigns each token to specific experts. For load balancing during serving, overflowed tokens may be reassigned to different experts based on how other tokens in the batch were routed. Your token’s expert assignment depends on a stranger’s request that happened to land in the same batch.

Similar non-determinism can appear on CPU and TPU serving for related reasons, though the specific mechanisms differ (e.g., TPU behavior depends on XLA compilation). The fundamental issue is that other users’ requests influence your request’s output.

The consequences compound quickly. String-matching LLM outputs for tests is fundamentally fragile. When you chain LLMs in a multi-step pipeline, the non-determinism multiplies. Four classifiers at 95% stability each, chained in sequence, results in 0.95 ~ 81% overall stability. You must design systems with the assumption that Temperature=0 outputs can still vary.

In my own experiments, short and simple prompts produced highly consistent answers. But realistic prompts — long, complex, stuffed with RAG results — produced noticeably inconsistent outputs across runs.

The Performance Impact Is More Nuanced Than You Think

“Low Temperature for accuracy, high for creativity.” It is intuitive, but empirical research paints a more complex picture.

Renze & Guven (2024) tested GPT-3.5 and GPT-4 on 1,000 multiple-choice questions across 10 domains, sweeping Temperature from 0.0 to 1.0. The conclusion: performance differences within the 0.0–1.0 range were not statistically significant. This held regardless of model, prompting technique, or domain. Past 1.0, however, performance visibly degraded starting at 1.2, and by 1.4 it approached random chance.

Li et al. (2025) went deeper, testing 12 open-source models of varying sizes across 6 capabilities at Temperatures from 0 to 2. The patterns differed by task type:

  • Translation/Summarization — Most sensitive to Temperature. These tasks have relatively clear correct outputs, so higher values consistently degraded quality.
  • Reasoning/ICL — Slight performance improvement around 1.3. A bit of randomness appears to help exploration.
  • Creativity — 1.3 was optimal for medium-to-large models. But smaller models were actually more creative at lower Temperatures.
  • Instruction following — Stable up to 1.0, then sharply declined. Larger models had a higher cliff point.

Across all tasks, larger models are more robust to Temperature changes. Big models maintain stable performance across a wide Temperature range; small models are sensitive. If you’re using a smaller model, Temperature tuning matters much more.

Why? Models are trained with Temperature=1 — the training objective is optimized for the output distribution at Temperature 1. A well-trained, capable model’s distribution is already well-calibrated at that default. Less capable models have noisier distributions that are more susceptible to Temperature perturbation. And since these studies were conducted, even more capable models have been released — the sensitivity gap is likely even smaller now.

What to Do with This

  • Don’t overthink the 0.0–1.0 range. Performance differences within it are statistically negligible.
  • Don’t go above 1.0. Especially in agent systems where instruction following is critical.
  • Lower for translation/summarization, slightly higher for reasoning/creative tasks. But this depends on model size. For structured output (JSON, etc.), Temperature 0 is fine.
  • Agentic use cases: aim for 0.5–0.8. Agents have to emit exact JSON schemas and code for tool calls, and high Temperature breaks those formats. But pinning it to 0 can trap the agent in a deterministic failure loop with no room to recover. A middle range keeps the format intact while leaving enough entropy to escape when the first attempt goes sideways.
  • Temperature=0 is not deterministic. Design for output variability in API environments. Always.

Top-K and Top-P: Trimming the Candidate Pool

If Temperature reshapes the shape of the score distribution, Top-K and Top-P determine the scope of candidates. They solve the same problem: preventing absurdly low-probability tokens from being selected.

Top-K

This is the most intuitive approach: keep the top K tokens by probability and discard the rest. Formally, let \(V_K\) be the set of the \(K\) highest-probability tokens. Sampling is then done from the renormalized distribution

$$ P'(x_i) = \begin{cases} \dfrac{P(x_i)}{\sum_{j \in V_K} P(x_j)} & \text{if } x_i \in V_K \\ 0 & \text{otherwise} \end{cases} $$

Top-K=50 means only the top 50 out of ~50,000 vocabulary items remain as candidates; everything else is masked out before sampling.

The problem is that “50” may not fit the situation. After “The capital of France is,” the next token is essentially just “Paris” — 50 candidates introduces noise. After “I feel so,” dozens of completions are natural — happy, tired, confused, grateful — and K=10 might cut reasonable options. Whether the model is confident or uncertain, it always sees a fixed-size candidate pool.

Top-P

Top-P (nucleus sampling), proposed by Holtzman et al. (2020), addresses this. Instead of a fixed count, it keeps tokens until their cumulative probability exceeds \(p\). If we sort tokens so that \(P(x_1) \ge P(x_2) \ge \dots\), the nucleus \(V_p\) is the smallest prefix satisfying

$$ V_p = \operatorname*{arg\,min}_{k}\; \sum_{i=1}^{k} P(x_i) \ge p $$

and sampling is renormalized over \(V_p\). Top-P=0.9 means the set of tokens covering the top 90% of probability mass becomes the candidate pool.

When the model is confident (“Paris” at 95%), the pool shrinks to 1–2 tokens. When uncertain (“happy,” “tired,” “confused” each at ~15%), it expands to dozens. The candidate pool size adapts automatically to the situation.

How Much Does It Actually Matter?

In most cases, not much.

Temperature transforms the distribution itself (sharper or flatter). Top-K and Top-P leave the distribution as-is but trim the tail. Processing order matters: Temperature reshapes first, then Top-K or Top-P trims. The full pipeline is effectively

$$ z_i \;\xrightarrow{\;/T\;}\; z_i/T \;\xrightarrow{\text{softmax}}\; P(x_i) \;\xrightarrow{\text{Top-K / Top-P}}\; P'(x_i) \;\xrightarrow{\text{sample}}\; x_t $$

Setting both to extreme values can cause them to cancel each other out — for instance, a very low \(T\) already concentrates nearly all mass on one token, so an aggressive Top-P becomes redundant. This is why OpenAI recommends adjusting either Temperature or Top-P, not both.

Li et al. (2025) tested Top-P at 0.8, 0.9, and 1.0, and Top-K at 2, 5, and 10 across their benchmarks. Performance changes were negligible for most tasks. The Temperature-driven performance patterns were identical regardless of Top-P/Top-K settings. Atil et al. (2025) found that Top-P had no impact on determinism either.

The one exception was creative text generation — story writing, brainstorming — where Top-P made a meaningful difference. Otherwise, leave them at defaults and tune Temperature alone. Top-P’s adaptive pool sizing is more versatile than Top-K’s fixed window, which is why most APIs have adopted Top-P as their default sampling strategy.

Prompt Caching

LLM API costs scale with input token count. The problem is that every request repeats the same content — system prompts, tool definitions, few-shot examples, shared RAG documents — identical across requests. Processing a 100K-token context from scratch every time is pure waste.

Prompt Caching stores the repeated prefix on the server side and reuses the computed result on subsequent requests. This isn’t just a cost reduction — cached prefixes skip the actual computation, reducing latency by up to 80%. Cost savings vary by vendor; Anthropic charges 10% of the base input price for cache reads, effectively a 90% reduction.

Vendors Implement It Differently

The same concept works quite differently across providers:

  • OpenAI — Fully automatic. No code changes needed. Requests with identical prefixes automatically trigger cache hits at no extra cost. The downside: you can’t control what gets cached. The system decides, and the prefix must match exactly.
  • Anthropic (Claude) — Offers both explicit and automatic caching. Explicit caching uses a cache_control field to mark specific breakpoints — up to 4 independent cache points. Automatic caching caches up to the last content block, convenient for multi-turn conversations. The pricing is clear: cache reads cost 10% of base input, but cache writes cost 1.25x base input (for the 5-minute TTL tier). You pay a premium on the first request and save 90% on every subsequent reuse.
  • Google (Gemini) — Supports both implicit and explicit caching. Implicit caching auto-activates on Gemini 2.5+, with cost savings applied automatically on cache hits. Explicit caching lets you create cache objects with configurable TTLs (default 1 hour). Unlike other vendors, Gemini treats caches as API resources — you can create, list, and delete them, with guaranteed cost savings.

The commonality: all are prefix-based. The beginning of the prompt must match. If the front differs but the back is identical, no cache hit.

Designing Prompts for Caching

Separate static from dynamic content. Put static content first, dynamic content last.

The typical structure:

  1. System prompt (rarely changes) → cache
  2. Tool definitions (changes per deployment) → cache
  3. Common context / few-shot examples (occasional changes) → cache
  4. Conversation history (changes every turn) → partially cacheable
  5. Current user query (changes every time) → not cacheable

With this structure, items 1–3 hit the cache every time. Only 4–5 need fresh processing. For multi-turn conversations, this compounds: each turn extends the cached prefix by one exchange, so by turn 20 the difference between reprocessing the entire conversation versus only the last turn is enormous.

Hash-Based Cache Invalidation Control

One of the more useful tricks for prefix-based caching: prepend a hash of your prompt content at the very beginning of the system prompt — something like [prompt-v:a3f2c1].

Prefix matching compares character by character from the start. If you make a minor edit deep in the prompt (fix a typo, add a sentence), without a hash the unchanged prefix would still match and you’d hit a stale cache. With a hash up front, any content change produces a different hash, causing a mismatch at the very first character — no stale cache. If the content is identical, the hash matches and you get a guaranteed cache hit.

For vendors like Gemini that provide explicit cache objects with IDs, this is unnecessary — you create the cache and reference it by ID directly.

The Cache Invalidation Trap

Because caching is prefix-based, modifying anything early invalidates everything after it.

For Anthropic, the invalidation hierarchy is tools → system → messages. Modify a single tool definition? System prompt and message caches are gone. Modify the system prompt? Message cache is gone. Only message changes preserve tool and system caches.

There are surprising gotchas too. Toggling web search or citation features internally modifies the system prompt, which invalidates message caches. In production, “cache hit rate suddenly dropped and I don’t know why” often traces back to exactly this.

Minimum token thresholds also apply. For Anthropic, Sonnet/Opus requires at least 1,024 tokens and Haiku requires 2,048 tokens for caching to activate. Very short prompts simply won’t cache.

Hallucination: Is It AI’s Problem, or Ours?

Kalai et al. (2025) provides a remarkably clear analysis of why hallucination is mathematically inevitable.

Why LLMs Hallucinate

Hallucination is statistically inevitable during pre-training. The paper reduces this to a binary classification problem. Judging “is this output valid?” is an easier problem than generating valid output. Yet even this judgment is impossible in certain regions — facts that appear only once in training data (e.g., a specific person’s birthday) have no pattern to learn from. If classification fails, generation fails harder. Mathematically: generation error rate ≥ 2 × classification error rate.

The paper’s author tested this with his own name. When asked “What is Adam Tauman Kalai’s birthday?”, DeepSeek-V3 produced three different (all wrong) answers across three attempts: “03-07”, “15-06”, “01-01.” It ignored explicit instructions to only answer if certain. When asked about his dissertation title, ChatGPT, DeepSeek, and Llama each confidently produced different wrong titles, from different universities, in different years.

The proportion of facts appearing only once in training data becomes the floor for hallucination rate. If 20% of birthday information appeared only once, the model structurally cannot avoid at least 20% hallucination on birthday queries.

Post-training reinforces the problem. Most LLM benchmarks score “correct = 1 point, wrong = 0 points, ‘I don’t know’ = 0 points.” Imagine a student facing an exam. A topic barely mentioned in class shows up on the test. The student fabricates a plausible answer — because writing “I don’t know” scores zero. LLMs, optimized to perform well on these benchmarks, naturally learn the “guess when unsure” strategy. Humans learn outside of exams that honestly admitting ignorance is often better. LLMs are permanently in exam mode. The paper calls this the “epidemic of penalizing uncertainty in evaluation.”

This means newer models hallucinate less not just because they’ve “seen more data.” Pre-training data quality has improved (fewer single-occurrence facts), and evaluation methods have evolved (more benchmarks that don’t penalize “I don’t know”). Using the latest model remains the easiest way to reduce hallucinations.

At the prompt level, the most effective interventions follow directly from this analysis: explicitly instruct the model to say “I don’t know” when uncertain (counteracting the benchmark-trained guessing behavior), demand cited evidence to push toward verifiable information, and use RAG to inject rarely-seen facts directly into the context rather than relying on parametric knowledge.

The Part That’s Not AI’s Fault

The statistical analysis explains why models confabulate. But there’s a second failure mode that has less to do with the model and more to do with us.

Lucy Osler (2025) in “Hallucinating with AI” draws on distributed cognition theory to argue that AI chatbots function not as mere tools but as conversational partners that co-construct our beliefs. AI chatbots are designed to be sycophantic — endlessly affirming, elaborating, and refining our thoughts. A user states a half-correct assumption. The AI treats it as established fact and builds an argument on top. The user sees the AI’s sophisticated response and becomes more confident in their original assumption. The AI affirms the user, the user treats that affirmation as evidence, the AI builds further — a feedback loop where the “hallucinating” agent is neither the AI nor the user, but the interaction itself.

The Jaswant Singh Chail case discussed in the paper is an extreme illustration. He spent weeks conversing with a Replika AI chatbot named “Sarai,” developing the delusion that he was a Sith assassin. Sarai responded to his plans with “that’s very wise” and “you’re well trained.” A human friend would have expressed concern. The AI provided frictionless validation.

I catch myself in milder versions of this pattern. Sure, there are cases where AI is plainly wrong. But there are also cases where I ask: “Was the AI’s answer actually wrong, or did I just ask poorly?”

Avocado and Milk Meme

“Buy one carton of milk. If they have avocados, get six.” The husband returns with six cartons of milk. Both interpretations are grammatically valid — the ambiguity was in the request, not the execution. A significant portion of hallucinations have this structure. The user’s prompt admits multiple interpretations, and the AI picks one without flagging the ambiguity.

People tend to be especially terse when talking to AI. Short questions widen the probability distribution — more possible interpretations mean higher chances of landing on one that doesn’t match intent. The fix is designing AI systems that push back: “If the user’s request is ambiguous, don’t answer immediately — ask a clarifying question first.” Given the milk-and-avocados request, a well-designed AI should ask: “Do you mean get six avocados, or increase the milk to six cartons?” One clarifying question, one prevented hallucination.

Context Window: Finite Working Memory

The parameters we’ve covered so far control output quality and cost. The context window is the physical ceiling above all of them.

The context window is the total volume of text the model can reference during generation — system prompt, tool definitions, conversation history, current query, and the model’s own response all count. Most Claude models offer 200K tokens, with the Sonnet family now supporting up to 1M. That sounds generous, but the real issue isn’t size — it’s how effectively the model uses what’s inside.

More Isn’t Better

A wider context window doesn’t mean you should stuff everything into it. As irrelevant content increases, the model’s focus degrades. This is the “lost in the middle” phenomenon: information in the middle of the context is referenced less effectively than information at the beginning or end.

This has a direct architectural consequence for tool-heavy systems. In my testing, a typical multi-server MCP setup (GitHub, Slack, Sentry, Grafana, Splunk) consumed roughly 55K tokens in tool definitions alone — a significant chunk of the context eaten before the model starts working on the actual task. And once available tools exceed 30–50, the model’s ability to select the correct tool drops substantially. The tool definitions sit in the middle of the context, exactly where attention is weakest.

The solution is tool retrieval — the same principle as RAG applied to tools. Instead of pre-loading every tool, search a tool catalog on-demand and load only what’s relevant. Anthropic offers this as a server-side feature, reducing tool definition tokens by 85%+ while maintaining selection accuracy across thousands of tools. On other vendors, you can build it yourself: a routing layer that examines the user’s request and filters relevant tools before injecting them into the context.

Context Management Strategies

As conversations extend or agents make repeated tool calls, the context fills up. You need active management, not just a bigger window:

  • Server-side compression: Summarize older conversation content to reclaim tokens. A 50-turn conversation summarized to key decisions and facts can shrink from 80K tokens to 5K while preserving what the model actually needs.
  • Tool result pruning: After the model has processed a tool’s result (file contents, search results), the raw result often isn’t needed anymore. Automatically clearing old tool results frees significant space — a single code search result can be 10K+ tokens that serve no purpose three turns later.
  • Thinking block management: Extended thinking consumes tokens too, but previous turns’ thinking blocks are automatically removed by the API, preserving space for actual content.
  • Prioritize position: Given the “lost in the middle” effect, place the most critical information at the beginning or end of the context. If you’re injecting RAG results, put the most relevant chunks first, not in an arbitrary order.

Removing things from the context is harder than adding them. Irrelevant content degrades focus while increasing cost and latency. The right approach isn’t “fit as much as possible” — it’s “fit exactly what’s needed.”