SWE-Bench Pro
TermA framework for software engineering benchmarking and evaluation.
article
10 storys
calendar_today
First: 2026-02-03
update
Last: 2026-06-17
open_in_new
Website
menu_book
Wikipedia
Stories
Completed digest stories linked to this service.
-
Harder, real‑world benchmarks land for coding agents2026-06-17Terminal-Bench 2.0 and new SWE-Bench variants push coding-agent evaluation toward harder, real-world tasks. T...
-
Terminal-Bench 2.0 shows coding agents still stumble on real CLI work2026-06-04Terminal-Bench 2.0 introduced a tougher CLI benchmark and found frontier agents still score under 65% on real ...
-
Context beats model: a cheap agent tops SWE-bench Verified2026-05-09A low-cost model paired with richer repo-aware context just topped SWE-bench Verified, showing agent wiring ca...
-
Eval-Ops gets concrete: Snowflake DARE-Bench and Terminal-Bench 2.0 make agent r...2026-05-08New deterministic agent benchmarks — Snowflake's DARE-Bench and Terminal-Bench 2.0 — are shifting model select...
-
Agentic coding is moving from hype to practice—design for reliability, governanc...2026-04-22Agentic coding is leaving the demo phase, forcing teams to engineer for reliability, governance, and real resu...
-
Anthropic ships Claude Opus 4.7: stronger agentic coding, stricter prompts, and ...2026-04-20Anthropic released Claude Opus 4.7 with big gains in agent coding, tighter instruction-following, and a resear...
-
SWE-bench scores are spiking, but variant mix-ups make the leaderboard noisy for...2026-04-12Vendors are touting big SWE-bench jumps, but versions differ and scores alone won’t pick your coding copilot. ...
-
Anthropic’s Mythos and Project Glasswing push AI into real-world vuln discovery,...2026-04-09Anthropic launched Project Glasswing and a Mythos Preview model that finds serious software bugs, pairing indu...
-
Claude Mythos posts record SWE-bench numbers, but it’s gated; tighten your evals...2026-04-08Anthropic’s Claude Mythos preview claims record SWE-bench results, but it isn’t publicly available and public ...
-
SWE-Bench Pro leaderboard: small gains at the top, big contexts, and mostly self...2026-04-04A new SWE-Bench Pro leaderboard shows top code models clustered around 0.55–0.58, with large contexts and self...