SWE-Bench Verified

Term

A framework for evaluating software engineering tools.

article 8 storys calendar_today First: 2026-02-03 update Last: 2026-04-17 open_in_new Website menu_book Wikipedia

Stories

Completed digest stories linked to this service.

Claude Mythos posts record SWE-bench numbers, but it’s gated; tighten your evals...

2026-04-08

Anthropic’s Claude Mythos preview claims record SWE-bench results, but it isn’t publicly available and public ...
OpenRouter’s coding leaderboard: free Qwen 3.6 Plus tops usage with 1M context a...

2026-04-06

OpenRouter’s latest usage data shows Qwen 3.6 Plus (free) leading coding workloads, with big context, solid re...
Coding LLMs, March 2026: default to Sonnet 4.6, escalate to GPT-5.4, watch scaff...

2026-03-22

March 2026 coding LLM benchmarks show mid-tier models rival flagships, but scaffolding and cost drive real-wor...
New long-horizon benchmarks say coding agents regress under maintenance; treat t...

2026-03-11

A new wave of long-horizon benchmarks shows most coding agents ship regressions over time, not just fixes. A ...
Agents ace one-shot coding, but most break your code over months—time to harden ...

2026-03-10

New results say most coding agents cause regressions during long-term CI, and a new MassGen release adds built...
MiniMax-M2.5 launches with SOTA coding claims; verify SWE-bench results

2026-03-04

MiniMax launched MiniMax-M2.5, a fast, low-cost coding and agentic model, but teams should validate its headli...
Coding Benchmarks Shake-up: Qwen 3.5, MiniMax M2.5, and a SWE-bench Reality Chec...

2026-03-03

Open models like Alibaba’s Qwen 3.5 and MiniMax M2.5 post strong coding-agent results, but OpenAI’s audit of S...
E2E agentic benchmarks replace SWE-bench; Gemini 3.1 favors deliberation

2026-02-24

Agentic coding benchmarks are shifting toward end-to-end app-building tests as SWE-bench Verified is being pha...