Hunting the Repetition Loop in a Self-Hosted LLM Agent

When the Agent Kept Repeating Itself At first I thought a request had hung. Where a tool call should have been, the model was instead generating its way toward max_tokens and getting nowhere — sometimes repeating the same sentence over and over, other times just producing low-value filler that never resolved into the JSON the tool call needed. Either way it would burn through the token budget, occasionally time out, and take the whole agent loop down with it. ...

June 23, 2026 · nbdawn

LLMs: The Things We've Been Overlooking

LLMs: The Things We’ve Been Overlooking “What temperature are you using?” If someone asks, what do you say? “The default.” “0.7.” “I don’t know — does it matter?” Most answers fall into one of those three. And if you try to justify the answer, you run out of words fast. That’s how we use LLMs. We call the APIs every day — stuff prompts into messages, send them off, get responses. But when the question becomes “What does Temperature actually do?”, “How is Top-P different from Temperature?”, “Does Prompt Caching just work if you turn it on?”, “Will hallucinations go away with a better model?” — the answers get fuzzy. ...

April 12, 2026 · nbdawn

Why My Second GPU Is Lazy: From PCIe to NVLink, Understanding x86 I/O Bottlenecks

Introduction Put two identical GPUs in the same machine, run the same workload on both, and the second one will often lag. Same model, same driver, same data — different throughput. It is not thermals or a bad BIOS profile. The second GPU is being starved at the bus level, and the reason has nothing to do with the card itself. Most of us live on top of drivers and kernel modules and never need to look down at how x86 systems actually move bytes between the CPU, RAM, and PCIe devices. But the moment you start debugging throughput asymmetry, tuning interrupt affinity, or wondering why irqaffinity matters, hardware topology stops being an abstraction. ...

January 2, 2026 · nbdawn