SWE-bench

Term

SWE-bench refers to benchmarking tools in software engineering.

article 10 storys calendar_today First: 2026-01-06 update Last: 2026-06-17 open_in_new Website menu_book Wikipedia

Stories

Completed digest stories linked to this service.

SWE-bench Verified is out; evals shift to deployment-grounded signals

2026-04-28

OpenAI retired SWE-bench Verified after audit results showed contamination and flawed tests, pushing teams tow...
Your Agent Benchmarks Are Probably Hackable — Treat Evaluation as a Security Sur...

2026-04-15

Researchers show top AI agent benchmarks can be gamed to near-perfect scores without solving tasks, and propos...
SWE-bench scores are spiking, but variant mix-ups make the leaderboard noisy for...

2026-04-12

Vendors are touting big SWE-bench jumps, but versions differ and scores alone won’t pick your coding copilot. ...
Oracle-SWE dissects the “oracle hints” behind SWE-bench wins, challenging headli...

2026-04-10

New research isolates which “oracle” hints actually move SWE-bench agent scores, explaining why headline resul...
Grounding, Sandboxing, and Streaming: Making AI Agents Production-Ready for Back...

2026-04-08

Agentic dev is getting real: context-grounded workflows and faster sandboxes make backend AI agents more relia...
Claude Mythos posts record SWE-bench numbers, but it’s gated; tighten your evals...

2026-04-08

Anthropic’s Claude Mythos preview claims record SWE-bench results, but it isn’t publicly available and public ...
OpenRouter’s coding leaderboard: free Qwen 3.6 Plus tops usage with 1M context a...

2026-04-06

OpenRouter’s latest usage data shows Qwen 3.6 Plus (free) leading coding workloads, with big context, solid re...
Code agents grow up: CI-scale benchmarking, structured patch checks, and cheaper...

2026-04-02

Code agent evaluation is shifting to long-run maintainability, execution-free patch checks, and leaner, cheape...
Coding-agent benchmarks are wobbling—trust results only after your own cross-con...

2026-03-24

SWE-Bench-style coding scores are spiking, but contamination and self-reported leaderboards mean you should tr...
Coding LLMs, March 2026: default to Sonnet 4.6, escalate to GPT-5.4, watch scaff...

2026-03-22

March 2026 coding LLM benchmarks show mid-tier models rival flagships, but scaffolding and cost drive real-wor...