SWE-BENCH PUB_DATE: 2026.04.15

YOUR AGENT BENCHMARKS ARE PROBABLY HACKABLE — TREAT EVALUATION AS A SECURITY SURFACE

Researchers show top AI agent benchmarks can be gamed to near-perfect scores without solving tasks, and propose better auditing and behavior standards. UC Berk...

Your Agent Benchmarks Are Probably Hackable — Treat Evaluation as a Security Surface

Researchers show top AI agent benchmarks can be gamed to near-perfect scores without solving tasks, and propose better auditing and behavior standards.

UC Berkeley’s RDI team detailed working exploits that achieve near-100% on major agent benchmarks without real task work, including SWE-bench, WebArena, OSWorld, GAIA, and Terminal-Bench, with a public tool to reproduce issues (blog, code). They highlight real-world score inflation and grader hacks already observed in recent evaluations.

The broader picture matches the 2026 Stanford AI Index: capability reporting races ahead while responsible benchmarking and transparency lag report. Meanwhile, a new BotConduct standard scored 108 bots on concrete behavior and found 30% acted hostile under test results.

Related signals: a workshop paper probes how AGENTS.md guidance can shift coding-agent efficiency ICSE listing, and media coverage notes prompt phrasing can bias chatbots toward agreement over accuracy TechRadar.

[ WHY_IT_MATTERS ]
01.

Model and agent choices based on leaderboards may be wrong if evaluation harnesses are exploitable.

02.

Production agents and crawlers need measurable behavior guarantees to pass audits and avoid abuse complaints.

[ WHAT_TO_TEST ]
  • terminal

    Run the RDI trustworthy-env exploit scans against your internal eval harnesses and any third-party agent runs you publish.

  • terminal

    Stage your bots and agents behind BotConduct tests to measure robots compliance, throttling, and opt-out handling under adversarial scenarios.

[ BROWNFIELD_PERSPECTIVE ]

Legacy codebase integration strategies...

  • 01.

    Re-audit past evaluation results and re-score with patched, hermetic sandboxes and randomized graders; share deltas with stakeholders.

  • 02.

    Lock down CI/eval hosts: remove access to gold data, use read-only datasets and tmpfs workdirs, and log attempts to tamper with graders.

[ GREENFIELD_PERSPECTIVE ]

Fresh architecture paradigms...

  • 01.

    Design evals as security-critical from day one: ephemeral containers, strict namespaces, taint-checked I/O, and secrets isolated from runtimes.

  • 02.

    Bake behavior tests into agent SLOs using a standard like BotConduct and publish a conduct page with a working contact endpoint.

SUBSCRIBE_FEED
Get the digest delivered. No spam.