SWE-Bench Verified
TermA framework for evaluating software engineering tools.
article
10 storys
calendar_today
First: 2026-02-03
update
Last: 2026-06-17
open_in_new
Website
menu_book
Wikipedia
Stories
Completed digest stories linked to this service.
-
Context beats model: a cheap agent tops SWE-bench Verified2026-05-09A low-cost model paired with richer repo-aware context just topped SWE-bench Verified, showing agent wiring ca...
-
Promptfoo joins OpenAI with a practical playbook for evaluating coding agents2026-04-29Promptfoo is now part of OpenAI and published a hands-on guide that reframes how to evaluate coding agents in ...
-
DeepSeek V4 shows up near the top of SWE‑Bench Verified at lower cost2026-04-26DeepSeek V4 preview models landed high on SWE-Bench Verified, offering near-SOTA scores with 1M context at a l...
-
Agentic coding grows up: domain-grounded agents and verifiable training move fro...2026-04-24Agentic coding is shifting from generic code suggestions to domain-verified systems that generate validated, p...
-
Open-weight coding models surge: Kimi K2.6 hype, Qwen3.6-27B runs local, Meta po...2026-04-23Open-weight coding models jumped forward this week, with Kimi K2.6 hype, a practical Qwen3.6-27B local setup, ...
-
Agentic coding is moving from hype to practice—design for reliability, governanc...2026-04-22Agentic coding is leaving the demo phase, forcing teams to engineer for reliability, governance, and real resu...
-
Anthropic ships Claude Opus 4.7: stronger agentic coding, stricter prompts, and ...2026-04-20Anthropic released Claude Opus 4.7 with big gains in agent coding, tighter instruction-following, and a resear...
-
Claude Mythos posts record SWE-bench numbers, but it’s gated; tighten your evals...2026-04-08Anthropic’s Claude Mythos preview claims record SWE-bench results, but it isn’t publicly available and public ...
-
OpenRouter’s coding leaderboard: free Qwen 3.6 Plus tops usage with 1M context a...2026-04-06OpenRouter’s latest usage data shows Qwen 3.6 Plus (free) leading coding workloads, with big context, solid re...
-
Coding LLMs, March 2026: default to Sonnet 4.6, escalate to GPT-5.4, watch scaff...2026-03-22March 2026 coding LLM benchmarks show mid-tier models rival flagships, but scaffolding and cost drive real-wor...