SWE-Bench Verified
TermA framework for evaluating software engineering tools.
article
8 storys
calendar_today
First: 2026-02-03
update
Last: 2026-04-17
open_in_new
Website
menu_book
Wikipedia
Stories
Completed digest stories linked to this service.
-
Claude Mythos posts record SWE-bench numbers, but it’s gated; tighten your evals...2026-04-08Anthropic’s Claude Mythos preview claims record SWE-bench results, but it isn’t publicly available and public ...
-
OpenRouter’s coding leaderboard: free Qwen 3.6 Plus tops usage with 1M context a...2026-04-06OpenRouter’s latest usage data shows Qwen 3.6 Plus (free) leading coding workloads, with big context, solid re...
-
Coding LLMs, March 2026: default to Sonnet 4.6, escalate to GPT-5.4, watch scaff...2026-03-22March 2026 coding LLM benchmarks show mid-tier models rival flagships, but scaffolding and cost drive real-wor...
-
New long-horizon benchmarks say coding agents regress under maintenance; treat t...2026-03-11A new wave of long-horizon benchmarks shows most coding agents ship regressions over time, not just fixes. A ...
-
Agents ace one-shot coding, but most break your code over months—time to harden ...2026-03-10New results say most coding agents cause regressions during long-term CI, and a new MassGen release adds built...
-
MiniMax-M2.5 launches with SOTA coding claims; verify SWE-bench results2026-03-04MiniMax launched MiniMax-M2.5, a fast, low-cost coding and agentic model, but teams should validate its headli...
-
Coding Benchmarks Shake-up: Qwen 3.5, MiniMax M2.5, and a SWE-bench Reality Chec...2026-03-03Open models like Alibaba’s Qwen 3.5 and MiniMax M2.5 post strong coding-agent results, but OpenAI’s audit of S...
-
E2E agentic benchmarks replace SWE-bench; Gemini 3.1 favors deliberation2026-02-24Agentic coding benchmarks are shifting toward end-to-end app-building tests as SWE-bench Verified is being pha...