When the Agent Kept Repeating Itself

At first I thought a request had hung. Where a tool call should have been, the model was instead generating its way toward max_tokens and getting nowhere — sometimes repeating the same sentence over and over, other times just producing low-value filler that never resolved into the JSON the tool call needed. Either way it would burn through the token budget, occasionally time out, and take the whole agent loop down with it.

I was building a B2B business-intelligence AI and putting an agent on top of it. The data couldn’t leave the customer’s network, so hosted APIs were off the table: no OpenAI, no Anthropic. You serve the model yourself, on hardware you control, or you don’t ship the feature. That pushes you into territory the hosted APIs quietly handle for you — quantization, serving frameworks, speculative decoding. Every knob is yours to turn, and yours to break.

The backend model was Gemma 4 (google/gemma-4-26B-A4B-it) on vLLM, quantized to NVFP4 (it started on AWQ INT4), with MTP/speculative decoding turned on for throughput. Which part of that stack was the problem took me a long time to pin down.

When a call did come back, the JSON was sometimes malformed and wouldn’t parse. But that case was cheap — easy to detect, easy to retry. The timeouts and max-token blowups were the ones that hurt: there’s nothing to salvage from a response that burned its entire token budget going nowhere. The user waits a long time and gets nothing usable, which for a product is about the worst failure shape there is. And because this is a multi-agent system, the damage didn’t stay local — one sub-agent stuck in such a loop dragged down the latency of the entire run, even when every other agent finished cleanly. A single degenerate generation, anywhere in the graph, was enough to make the whole thing feel broken.

The confusing part: the benchmarks looked fine. Standard capability evals showed no regression. On paper the model could reason, write code, summarize — nothing in the aggregate scores flagged a problem. But agent behavior isn’t a benchmark. It’s a tight loop of structured generation, and that’s a different kind of pressure than the open-ended text those evals measure. Why the benchmarks miss it is something we’ll come back to later.

I patched the worst of it and shipped. But not knowing why the patch worked — or whether something more fundamental would carry over to the next model — bugged me enough to go dig. This post is that dig.

Why Repetition Happens — Neural Text Degeneration

In 2020, Holtzman et al. published “The Curious Case of Neural Text Degeneration” (ICLR 2020), which named and diagnosed a failure mode anyone who used early GPT-2 had seen: generated text that collapses into repetition, blandness, and incoherence. The name stuck — neural text degeneration — and the years since have filled in why it happens.

Degeneration takes a few forms — bland generic text, incoherent drift, and outright repetition. For an agent, repetition is the worst of them. Bland or slightly-off text still terminates and still parses; a repetition loop doesn’t terminate at all until it hits max_tokens, the timeout-and-burn-the-budget failure from the last section. It’s also the worst one for the user: the other forms at least return something quickly, whereas a loop makes them wait out the entire token budget — the longest possible response time — only to hand back nothing usable. Slow and empty is about as bad as a failure gets. So that’s the form worth understanding in detail.

Repetition Is a Peaked Distribution

Every time a language model generates a token, it produces a probability distribution over the whole vocabulary — every possible next token, each with some likelihood. What matters is the shape of that distribution.

A flat distribution spreads probability across many plausible candidates; a peaked distribution dumps almost all of it onto a single token:

Flat:   [the: 0.12, a: 0.10, this: 0.09, one: 0.07, ...]  ← several viable
Peaked: [the: 0.78, a: 0.06, this: 0.03, one: 0.01, ...]  ← one dominates

Repetition is what a peaked distribution looks like over time. When one token dominates step after step, the output locks onto a phrase and circles it. So the real question is: what makes the distribution peak like that? It’s tempting to blame decoding — the strategy that picks the token — and decoding does play a role (the next section). But that’s the smaller, downstream half, and the only half you tune from the outside. The real source is upstream, in the model itself.

The Model Makes the Distribution Peak

Rather than one tidy chain of cause and effect, the research looks at this from a few different angles. They don’t all sit at the same level — and seeing where they don’t line up is more useful than pretending they do.

The distribution level. The most direct view is the probability distribution the model emits. It starts at training time: Holtzman et al. showed that models assign higher probability to repetitive text than to equally fluent human writing — repetitive context is statistically very predictable, so training loss is lower there, and the model picks up a quiet preference for it. Xu et al. (NeurIPS 2022) showed how that preference snowballs at inference: the more a sentence has already appeared, the higher the model rates the odds of producing it again, with high-probability sentences snowballing fastest. A single repetition starts the climb, each one making the next more likely. These two are the same story — a distribution biased toward repetition, getting worse as the loop feeds itself.

One level deeper. A flawed distribution is itself a symptom of something happening inside the model. Duan et al.’s “Circular Reasoning” is the angle I found most striking: in current reasoning models, the collapse begins before it reaches the output. Semantic repetition precedes textual repetition — the internal state starts circling one idea first, held there by a self-reinforcing attention pattern where the generated span becomes “a logical premise for its own recurrence,” and only later do the same words appear in the text. By the time you can see the loop, the internal state has already gone; the repeated text is the symptom, not the disease. (They can catch the precursor on that internal signal, before any text repeats, with a CUSUM detector.) This isn’t a different cause from the distribution view — it’s the same failure seen one layer down.

The dial that turns it all up: length. Both of the views above get worse the longer the generation runs — more prior tokens to reinforce, more room for a loop to entrench. And this isn’t specific to repetition. Chroma’s “context rot” study shows the broader pattern across 18 frontier models: as input grows, output reliability degrades unevenly, and one of the concrete ways it breaks is exactly our failure — on a repeated-words task, models drift into generating the same token until they hit the output limit. The authors are careful to say they don’t know the mechanism, only that longer context reliably makes things worse. That’s enough for our purposes. Repetition is one face of a length-driven degradation that every long-running model is subject to. And an agent is long-running by nature: it reasons through an extended chain of thought before each tool call, which puts it squarely in the regime where this degradation shows up.

Which Means the Real Fix Is Out of Reach

If the root cause is the weights, the clean fix is to change the weights — retrain the foundation model so it’s less prone to this collapse. But that’s not a knob anyone serving an off-the-shelf model gets to turn. Retraining Gemma 4 means a training pipeline, data, and a lot of GPUs; it’s the vendor’s job, and an expensive one.

So I went looking for the next-best thing: with the weights fixed, what can I do to make that internal collapse less likely — or, when it happens anyway, less damaging? That reframes the whole problem. I can’t stop the model from being repetition-prone, but I can avoid feeding the collapse. Some levers keep the distribution from getting worse than the weights already make it (the decoder, the quantization format, the router); others contain the blast radius when a generation degenerates regardless. None of them cure the bias — they just keep it from taking down a production agent. The rest of this post is what I found, one lever at a time, starting with the decoder since it’s the cheapest place to look.

The Decoding Knobs Are Stuck

The first place to reach when output degenerates is the decoder — temperature, top-p, repetition penalty. These are the cheapest levers, the ones you can change without touching the model or the serving stack. The problem is that for an agent, every one of them is boxed in: the setting that suppresses repetition is the same setting that breaks something else.

(I dug into the mechanics of temperature, top-p, and the myth that T=0 is deterministic in an earlier post; here I’ll keep the basics brief and focus on why these knobs are stuck.)

Temperature

Temperature is a scalar applied to the logits before the softmax during token sampling:

P(token_i) = exp(logit_i / T) / Σ exp(logit_j / T)

When T < 1, dividing by a small number amplifies logit differences — the highest-probability token’s lead widens, and the distribution spikes. When T > 1, those differences compress, and the distribution flattens:

TemperatureDistribution shapeRisk
T → 0 (sampling limit)Single-token spikeRepetition / degenerate loops
T < 1Sharp, peakedElevated repetition risk
T = 1Unmodified model distributionBaseline
T > 1Flat, diffuseIncoherence / hallucination

Note: Greedy decoding bypasses temperature scaling entirely — T→0 is the sampling limit that approaches this behavior, not the same operation.

This is the knob that feeds the collapse from the last section, through a mechanism worth naming: maximization. The sharper the distribution, the closer decoding gets to always taking the highest-probability token — and over a long generation that’s a trap. Once the context is full of tokens the model already produced, the highest-probability next token is often just the safest one, a phrase already weighted by what came before; picking it weights it further, which makes it likelier still. Low temperature, by sharpening the distribution, walks decoding straight into that self-reinforcing loop. A flatter distribution leaves room to step around it.

So the clean fix seems obvious: raise the temperature. This is a well-established result — Holtzman et al. demonstrated it — but it was measured in open-ended generation (stories, dialogue, creative writing), and you might reasonably ask whether it carries over to a reasoning agent. It does. DeepSeek-R1’s own paper (arXiv:2501.12948) reports that greedy decoding inflates repetition rates on long reasoning outputs, which is exactly why they evaluate at temperature 0.6 rather than zero — the same finding, confirmed on the kind of model we’re actually running.

There’s a ceiling, though. DeepSeek also found that pushing temperature too high (past ~0.7) produces incoherent generation — so even just to manage repetition, the usable zone for a reasoning model is a narrow band, not “as high as possible.” And that’s the picture before tool calls enter. Once they do, the band gets squeezed from the other side too.

The squeeze comes from what the agent has to produce. A tool call isn’t free-form text — it’s structured output a parser has to accept, typically JSON like this:

{
  "tool": "search_database",
  "parameters": {
    "query": "Q3 revenue by region",
    "limit": 10
  }
}

Any deviation — a missing brace, an extra field, a string where an integer is expected — breaks the downstream parser.

Format-breaking tokens exist somewhere in the model’s distribution for nearly any generation. At low temperature the distribution is peaked enough that those tokens are rarely sampled; as temperature rises, the distribution broadens and lower-probability tokens — including ones that break the expected format — get sampled more often. Lower temperature makes format failures less likely, but it’s a probability shift, not a guarantee. Constrained decoding (grammar-guided sampling via tools like outlines or jsonformer) is what actually guarantees well-formed output; temperature only nudges the odds.

Agentic systems run low for exactly this reason — though there’s no single magic number. Recommended settings vary by model: some tool-use guides call for greedy decoding outright, others (GLM, Qwen, and friends) publish agentic temperature ranges nowhere near zero. The common thread isn’t a value, it’s the direction: format-critical work wants a sharper distribution. Which leaves temperature pulling two ways at once:

Format stability  →  wants low temperature (sharper distribution)
Repetition safety →  wants temperature closer to 1.0

One parameter can’t satisfy both pulls. Push it up to escape repetition and format reliability drops; push it down to protect format and repetition risk climbs. The two demands squeeze temperature into a thin band with almost no slack — and since format failures are non-negotiable for an agent, you can’t even trade toward the repetition-safe side. The first dial is effectively pinned: not at a value I chose, but by constraints I couldn’t move.

Top-p Won’t Rescue It

The natural objection is that temperature isn’t the only sampling knob — what about top-p? Nucleus sampling (Holtzman et al., 2019) truncates the long, low-probability tail: keep the smallest set of tokens whose cumulative probability exceeds p, sample only from those. It’s good at the flat-distribution problem, trimming unreliable tail tokens.

But our problem is the opposite — a distribution that’s too sharp. When T is low and the top token already holds most of the mass, top-p collapses to near-greedy regardless of p. If p = 0.9 but the top token is at 0.92, the nucleus is a single token; the threshold is crossed before a second one is ever admitted. Top-p can’t widen a distribution that’s already collapsed, so it does nothing for the repetition we’re stuck with.

The Other Dial: Repetition Penalty

With temperature boxed into that narrow band — too low and it loops, too high and it breaks format — there’s little room left to move it, so the obvious next move is a repetition penalty. It helps, but only so far. A standard penalty counts how often a token has appeared and discounts it linearly — which works poorly on long reasoning traces, where the question isn’t raw frequency but whether a token is extending a repeated span (the LZ Penalty work, Ginart et al., makes this case in detail). Push it hard enough to kill the loops and it also suppresses tokens that legitimately recur — JSON keys, variable names, fixed phrasing — so structured output starts to degrade. There’s a usable middle setting (I landed on 1.05 after a sweep), but no value that buys repetition safety for free.

So every decoding knob is stuck. Temperature can’t go up without breaking format; the penalty can’t go up without breaking structured output; top-p does nothing against a too-sharp distribution. This is the practical consequence of what we saw earlier — the real source is the model’s distribution, and the decoder only reads it. So if the knobs that read the distribution are all boxed in, the question becomes what’s making that distribution worse than it has to be. The prime suspect sits one layer down, in the quantization.

Quantization: A Suspect, but Not the Culprit I Expected

The decoder is boxed in and the model’s bias is fixed in the weights. The next thing under my control was the quantization. It was a natural suspect — I was running the model in 4 bits, and 4 bits is lossy — and switching from AWQ INT4 to NVFP4 had seemed to help. Digging in was the most humbling part of this whole exercise: the more I read, the weaker the case for “quantization caused it” got.

What it is, and why I suspected it

A 26B model in BF16 needs roughly 52GB just for weights — more than fit comfortably on the GPUs I had. Quantization shrinks that by storing each weight in 4 bits instead of 16, a ~4x cut. It’s a storage compression done after training (post-training quantization, PTQ): each weight is rounded to the nearest point on a coarse 4-bit grid and saved.

One thing tripped me up at first. Weight-only methods like AWQ dequantize back to BF16 right before the matmul, so the math runs in full precision — then why is anything lost? Because the loss already happened at save time: rounding to the grid throws the difference away, and dequantizing just maps the grid point back up, it can’t recover what was discarded. (NVFP4 on Blackwell goes further and multiplies in 4 bits directly, but the lossy step is the same rounding.) So the weights the model runs on are permanently coarsened — and since the previous section pinned repetition on a fragile distribution, anything that perturbs the weights is a plausible way to make that distribution worse. That’s why I suspected it.

The literature mostly says quantization is fine

The trouble is that when I went looking for support, the evidence largely pointed the other way. On average, 4-bit quantization barely moves quality: NVFP4 and INT4 land in roughly the same place on standard benchmarks, and where there’s a difference it’s not in the mean. “Bridging the Gap” (ICLR 2026) even shows INT4 matching or beating NVFP4 on average accuracy with the right method — so the format gap is smaller than the marketing suggests.

It gets worse for my theory. Dong et al.’s “Can Compressed LLMs Truly Act?”, which tests agentic capability under compression directly, finds 4-bit quantization preserves tool use with only a 1–3% drop. And Liu et al.’s “Quantization Hurts Reasoning?” (COLM 2025) reports flatly that quantized models do not generate longer outputs — the exact opposite of what you’d see if quantization were driving runaway loops. Taken together: there is no clean published evidence that quantization, on its own, causes the repetition and broken JSON I was seeing. If I’m honest, the literature mostly told me to look elsewhere.

But our BI agent benchmark showed a small, stubborn gap

Here’s the part I can’t dismiss, though. On the BI agent benchmark we use in-house — a task suite built from the product’s real agentic workloads, not a standard eval like MMLU — the quantized model scored a little below the unquantized one — but not evenly. The gap showed up almost entirely on the long, many-tool-call tasks, and it was intermittent: most runs were fine, a handful fell apart. That shape matches the one thing the format papers agree on. INT4’s average quality is fine; what’s worse is its consistency — it fails rarely, but harder when it does.

And rare-but-hard is exactly the wrong failure mode for an agent. A normal chatbot turn is one shot; if it’s good on average, you’re fine. A long agentic workflow is dozens of generations chained together, so it doesn’t get the average — it gets the worst of those dozens, and one bad generation can sink the whole run. The longer the chain, the more chances to hit the bad one. So my suspicion narrowed from “quantization makes the model worse” to something more specific: low-bit quantization, INT4 in particular, makes the occasional severe failure a little more likely — and long tool-use chains run into “occasional” far more often than a single chat turn does. Still a suspicion, not a result: we’re mid-build, and I haven’t run the controlled comparison that would settle it.

Why I switched anyway — and it wasn’t mainly for quality

I still had to quantize: memory and throughput left no choice. So the real question wasn’t whether to quantize but which format, and the variance finding gave me a reason to try NVFP4 over AWQ. After the switch, the max-token blowups did get a little less frequent — consistent with the tail-narrowing story, though far too small and noisy to call proof.

But I’ll be honest about what actually moved the needle: it was speed, not quality. NVFP4 on Blackwell was meaningfully faster, and that did more for the user experience than the marginal quality bump. Even when a generation still ran long, getting through it faster softened the worst of the pain. The quality improvement was a maybe; the latency improvement was real and immediate. If you’re choosing a format under the same ambiguity I was, that’s the honest decision criterion — pick the one whose certain benefit you actually need, and treat the quality difference as a bonus you can’t yet bank on. (The controlled INT4-vs-NVFP4-vs-FP16 quality comparison nobody has published is the gap my next post is meant to fill.)

A Wrinkle for MoE: Quantization Can Shift Which Experts Fire

One more factor is specific to my model. Gemma 4 26B is a Mixture-of-Experts model: instead of every token passing through the same feed-forward network, a small router scores all 128 experts per layer and sends the token to the top 8. That routing step doesn’t exist in a dense model, and it adds a way for quantization to go wrong that’s easy to miss.

The intuitive worry — “quantizing the router corrupts its scores” — turns out to be the wrong worry, because nobody quantizes the router. It’s a tiny fraction of the parameters, and the standard practice (in vLLM, in production quantized builds) is to leave it in full precision while quantizing the experts. So the router’s own math is fine.

The subtler problem is that the router can be fed bad input. EAC-MoE (ACL 2025) traces the actual chain: quantizing the attention and expert weights perturbs the hidden state before it reaches the router, so an unquantized router scoring a slightly-off input can still flip which experts win — the paper measures this “expert-shift” and finds preserving the original expert selection recovers much of the lost quality. EAQuant (arXiv:2506.13329) reports the same failure under the name “routing instability.” Both argue that keeping the router in full precision isn’t sufficient on its own; you have to recalibrate its outputs against the quantized inputs. And it should bite hardest on a fine-grained model like mine — with 128 experts and a thin margin between ranks 8 and 9, it takes less perturbation to flip the cutoff than in an 8-expert model.

I’ll be upfront that this is the most speculative link in the chain for my case. I haven’t run EAC-MoE’s calibration or measured expert-shift on my own deployment — I’m taking the mechanism on the papers’ word. But it’s theoretically coherent and points at a concrete lever if the symptoms ever warrant it: recalibrate routing for the quantized model rather than assuming an untouched router is enough. For now I’m noting it as a plausible contributor, not a confirmed cause.

MTP: The One Knob That Wasn’t a Suspect

The last piece of the stack was MTP (Multi-Token Prediction), used here for speculative decoding — the MTP module drafts several tokens ahead and the main model verifies them in parallel (EAGLE-style draft-and-verify). And unlike the others, this one I can mostly clear.

The reason is that speculative decoding is lossless in a specific, meaningful sense: it’s constructed to reproduce the target model’s exact output distribution. It changes how fast tokens come out, not which tokens come out. So in theory it can’t be the thing degrading quality — and in practice, on Gemma 4 26B, I couldn’t tell the difference in output quality with it on versus off. I keep it on.

There’s one caveat worth stating so the “lossless” label doesn’t lull you: lossless means matching the target, and the target is the quantized model. Speculative decoding faithfully reproduces whatever the NVFP4 model would have produced on its own — including its bad days. It won’t add degradation, but it won’t save you from the quantization issues in the previous sections either. If you ever do suspect the draft path, it’s a clean lever to test: run the draft unquantized (--speculative-draft-model-quantization unquant in SGLang) and compare.

If anything, MTP earned its keep in the exact situation this whole post is about. When a generation does fall into a repetition loop and runs to max_tokens, MTP gets through those wasted tokens faster — the loop still happens, but the user waits less for it to play out and the system recovers sooner. It doesn’t fix the failure, but it softens the cost of one, which on a long agentic workload is worth a lot.

What to Suspect, What to Change — A Comprehensive Diagnostic Guide

Every factor covered in this series carries a trade-off. The question was never “how do I eliminate degeneration for free” — it was always “what am I willing to give up to get stability?”

The Four Factors at a Glance

FactorHow it relates to the failureMy confidence it mattered hereCan I change it?
The model’s biasRoot cause — weights are repetition-prone before anything elseHigh (well-established)✗ Can’t retrain a vendor model
Temperature / penaltyDecoding can tip a fragile distribution over; agent format pins it lowHigh✗ Boxed in both directions
Quantization formatINT4’s higher variance may widen the rare-failure tailLow–medium (suspected, not measured)○ Switched to NVFP4 — mainly for speed
MoE routingQuantizing experts can shift expert selection via corrupted router inputTheoretical only (didn’t test)△ Router recalibration exists
MTP / speculativeLossless w.r.t. the target; reproduces but doesn’t add degradationCleared — not a cause○ Kept on; it speeds recovery

What to actually do, cheapest first

You usually can’t ablate everything at once, so order the moves by cost — cheap and reversible first, fundamental last:

  1. Nudge decoding, minimally. Tweak repetition_penalty and temperature — but judge the result against your benchmark, not the repetition rate alone. In my case repetition fell and the quality score fell with it, so I stopped at the smallest value that helped.
  2. Toggle one serving option at a time. MTP off, KV-cache quantization off, draft unquantized — separately, so you can attribute any change.
  3. Measure perplexity and tool-call success rate as two numbers. Perplexity fine but tool calls failing → quantization damage on structured output. Both bad → over-compressed or wrong model. Both fine → it’s your prompts and schemas, not quantization.
  4. Switch format, then escalate. INT4 → NVFP4 for the variance; if it’s MoE and you can, recalibrate routing; if nothing holds, a QAT’d checkpoint is the last resort — and usually the vendor’s job, since QAT means retraining.

Closing Thoughts

The honest summary is that I never found a single culprit. The repetition was rooted in the model’s own distribution, which I can’t retrain; the decoding knobs that might suppress it are pinned by the agent’s format requirements; quantization probably widens the tail but the evidence is mixed and I haven’t measured it on my own stack; MoE routing is a plausible contributor I’m taking on faith; and MTP, the thing in the title, turned out to be innocent. What I gained wasn’t a fix so much as a map — which knobs are stuck, which are levers, and which suspect is worth chasing next time the same symptoms show up on a different model.

That reframes the whole problem as a choice rather than a fix: separate what you can’t change from what you can, then give up whatever costs least against your actual goal. For me the goal was latency and tool-call reliability, so I took the certain win (NVFP4’s speed) and treated the quality difference as a bonus I couldn’t yet prove. Someone optimizing for raw accuracy would trade differently. None of these calls are universally right; they’re right relative to a target.

The last lever is the one I underrated going in: the model isn’t the only layer you can fix. Some generations will degenerate, so alongside trying to prevent that I built two operational layers around it.

The first catches trouble before it turns into a dead request. The shape of this is borrowed from how schedulers like Kubernetes treat resources: set a hard limit, but don’t just hard-kill a process the moment it’s hit — send it a signal first and let it shut down gracefully. Each agent runs against a budget — wall-clock time, tokens, tool-call count — and when it crosses a threshold (say, 80% of the budget), I inject a nudge that plays the role of SIGTERM, the warning before the kill: you’re near your limit; wrap up. Summarize what you tried, what you found, what failed, and what a next attempt should do differently. Hitting the budget ceiling itself is the SIGKILL. Instead of dying mid-loop with nothing to show, the agent leaves a structured hand-off, and the orchestrator can read it to decide whether to retry — and in which direction — rather than blindly re-running the same thing.

The second layer contains the damage when a generation degenerates anyway: sub-agents with their own token budgets and timeouts, so one runaway loop becomes a local failure the orchestrator can retry or route around instead of a stalled run. It’s a bulkhead, not a cure, but together with the nudge it’s what actually rescued the user experience the loops (back at the very start) were wrecking. Sometimes the cheapest robust fix isn’t a better distribution — it’s a system that degrades gracefully and survives a bad draw.

There’s still a gap I’d like to close: nobody has publicly measured repetition rate, tool-call success, and expert-shift for INT4 vs. NVFP4 vs. FP16 on the same model with the same decoding. That’s the controlled comparison my next post is meant to run.