SWE-Bench Pro

Term

A framework for software engineering benchmarking and evaluation.

article 10 storys calendar_today First: 2026-02-03 update Last: 2026-06-17 open_in_new Website menu_book Wikipedia

Stories

Completed digest stories linked to this service.

Harder, real‑world benchmarks land for coding agents

2026-06-17

Terminal-Bench 2.0 and new SWE-Bench variants push coding-agent evaluation toward harder, real-world tasks. T...
Terminal-Bench 2.0 shows coding agents still stumble on real CLI work

2026-06-04

Terminal-Bench 2.0 introduced a tougher CLI benchmark and found frontier agents still score under 65% on real ...
Context beats model: a cheap agent tops SWE-bench Verified

2026-05-09

A low-cost model paired with richer repo-aware context just topped SWE-bench Verified, showing agent wiring ca...
Eval-Ops gets concrete: Snowflake DARE-Bench and Terminal-Bench 2.0 make agent r...

2026-05-08

New deterministic agent benchmarks — Snowflake's DARE-Bench and Terminal-Bench 2.0 — are shifting model select...
Agentic coding is moving from hype to practice—design for reliability, governanc...

2026-04-22

Agentic coding is leaving the demo phase, forcing teams to engineer for reliability, governance, and real resu...
Anthropic ships Claude Opus 4.7: stronger agentic coding, stricter prompts, and ...

2026-04-20

Anthropic released Claude Opus 4.7 with big gains in agent coding, tighter instruction-following, and a resear...
SWE-bench scores are spiking, but variant mix-ups make the leaderboard noisy for...

2026-04-12

Vendors are touting big SWE-bench jumps, but versions differ and scores alone won’t pick your coding copilot. ...
Anthropic’s Mythos and Project Glasswing push AI into real-world vuln discovery,...

2026-04-09

Anthropic launched Project Glasswing and a Mythos Preview model that finds serious software bugs, pairing indu...
Claude Mythos posts record SWE-bench numbers, but it’s gated; tighten your evals...

2026-04-08

Anthropic’s Claude Mythos preview claims record SWE-bench results, but it isn’t publicly available and public ...
SWE-Bench Pro leaderboard: small gains at the top, big contexts, and mostly self...

2026-04-04

A new SWE-Bench Pro leaderboard shows top code models clustered around 0.55–0.58, with large contexts and self...