SWE-bench
TermSWE-bench refers to benchmarking tools in software engineering.
article
10 storys
calendar_today
First: 2026-01-06
update
Last: 2026-04-17
open_in_new
Website
menu_book
Wikipedia
Stories
Completed digest stories linked to this service.
-
Your Agent Benchmarks Are Probably Hackable — Treat Evaluation as a Security Sur...2026-04-15Researchers show top AI agent benchmarks can be gamed to near-perfect scores without solving tasks, and propos...
-
SWE-bench scores are spiking, but variant mix-ups make the leaderboard noisy for...2026-04-12Vendors are touting big SWE-bench jumps, but versions differ and scores alone won’t pick your coding copilot. ...
-
Oracle-SWE dissects the “oracle hints” behind SWE-bench wins, challenging headli...2026-04-10New research isolates which “oracle” hints actually move SWE-bench agent scores, explaining why headline resul...
-
Grounding, Sandboxing, and Streaming: Making AI Agents Production-Ready for Back...2026-04-08Agentic dev is getting real: context-grounded workflows and faster sandboxes make backend AI agents more relia...
-
Claude Mythos posts record SWE-bench numbers, but it’s gated; tighten your evals...2026-04-08Anthropic’s Claude Mythos preview claims record SWE-bench results, but it isn’t publicly available and public ...
-
OpenRouter’s coding leaderboard: free Qwen 3.6 Plus tops usage with 1M context a...2026-04-06OpenRouter’s latest usage data shows Qwen 3.6 Plus (free) leading coding workloads, with big context, solid re...
-
Code agents grow up: CI-scale benchmarking, structured patch checks, and cheaper...2026-04-02Code agent evaluation is shifting to long-run maintainability, execution-free patch checks, and leaner, cheape...
-
Coding-agent benchmarks are wobbling—trust results only after your own cross-con...2026-03-24SWE-Bench-style coding scores are spiking, but contamination and self-reported leaderboards mean you should tr...
-
Coding LLMs, March 2026: default to Sonnet 4.6, escalate to GPT-5.4, watch scaff...2026-03-22March 2026 coding LLM benchmarks show mid-tier models rival flagships, but scaffolding and cost drive real-wor...
-
Cursor ships Composer 2: a cheaper, stronger coding model with a fast default — ...2026-03-21Cursor launched Composer 2, a cheaper coding model that claims big quality gains and a new fast default varian...