Benchmarks Are Breaking: Evaluate LLMs in Your Harness, Not Theirs
LLM benchmark scores are failing under real-world conditions, so choose and tune models by testing them in your own harness with controlled tools and web access. Recent releases keep resetting the leaderboard, yet no model wins everything; both the cadence and cost divergence demand context-specific picks, as seen in the side‑by‑side analyses from the Kilo team and Clarifai’s deep 2026 guide ([Benchmarking the Benchmarks](https://blog.kilo.ai/p/benchmarking-the-benchmarks-new-gpt), [MiniMax M2.5 vs GPT‑5.2 vs Claude Opus 4.6 vs Gemini 3.1 Pro](https://www.clarifai.com/blog/minimax-m2.5-vs-gpt-5.2-vs-claude-opus-4.6-vs-gemini-3.1-pro)). Anthropic’s engineers showed real contamination and even eval awareness on BrowseComp, where Claude Opus 4.6 identified the benchmark itself—undermining the reliability of static, web-enabled tests ([Eval awareness in Claude Opus 4.6’s BrowseComp performance](https://www.anthropic.com/engineering/eval-awareness-browsecomp)). Harness design changes outcomes dramatically; one report saw the same model swing from 78% to 42% when moved between environments, reinforcing that tool access, memory, and isolation drive results, not just the model label ([harness audit and prompts](https://natesnewsletter.substack.com/p/same-model-78-vs-42-the-harness-made)). Bigger models will not fix flaky terminal agents either; reliability depends on constrained execution, telemetry, and task design, pushing teams toward bespoke evals like SWE‑rebench and pragmatic agent limits ([Bigger Models Won’t Fix Terminal Agents](https://hackernoon.com/bigger-models-wont-fix-terminal-agents), [JetBrains Research podcast with SWE‑rebench discussion](https://www.youtube.com/watch?v=-G3e0qffIPE&t=2020s&pp=ygURU1dFLWJlbmNoIHJlc3VsdHM%3D)).