terminal
howtonotcode.com
SWE-bench logo

SWE-bench

Term

SWE-bench refers to benchmarking tools in software engineering.

article 13 storys calendar_today First seen: 2026-01-06 update Last seen: 2026-03-03 open_in_new Website menu_book Wikipedia

Stories

Showing 1-13 of 13

Coding Benchmarks Shake-up: Qwen 3.5, MiniMax M2.5, and a SWE-bench Reality Check

Open models like Alibaba’s Qwen 3.5 and MiniMax M2.5 post strong coding-agent results, but OpenAI’s audit of SWE-bench Verified shows contamination and flawed tests that can mislead real-world adoption. Alibaba’s Qwen 3.5 family uses a sparse MoE design (397B total/17B active), ships open weights under Apache 2.0, and shows strong instruction following and competitive coding scores in public benchmarks, with setup guidance and comparisons to frontier models detailed in this deep-dive guide [Qwen 3.5: The Complete Guide](https://techie007.substack.com/p/qwen-35-the-complete-guide-benchmarks). MiniMax’s latest model claims state-of-the-art coding and agentic performance, faster task completion, and ultra-low runtime cost (about $1/hour at 100 tok/s), alongside reported scores on coding and browsing evaluations [MiniMax-M2.5 on Hugging Face](https://huggingface.co/unsloth/MiniMax-M2.5). OpenAI, however, reports that many SWE-bench Verified tasks have broken tests and that major models were trained on benchmark solutions, halting its use of the metric and urging caution in interpreting scores [OpenAI Abandons SWE-bench Verified](https://blockchain.news/news/openai-abandons-swe-bench-verified-contamination-flawed-tests). For quick, low-cost trials of multiple “top models,” a short explainer points to an Alibaba Cloud coding plan bundling popular options [This $3 AI Coding Plan Gives You Every Top Model You Need](https://www.youtube.com/watch?v=Qnz7S-5fzWo&pp=ygUXbmV3IEFJIG1vZGVsIGZvciBjb2RpbmfSBwkJrgoBhyohjO8%3D).

calendar_today 2026-03-03
qwen-35 alibaba alibaba-cloud minimax-m25 openai

Agentic coding meets reality: benchmarks expose gaps, runtime tracing narrows them

New evidence shows LLMs still struggle with production-grade observability and cross-cutting tasks, but agentic workflows augmented with runtime facts significantly improve reliability and speed. An independent SRE benchmark, [OTelBench](https://www.freep.com/press-release/story/145971/quesma-releases-otelbench-independent-benchmark-reveals-frontier-llms-struggle-with-real-world-sre-tasks/), finds frontier models pass only 29% of OpenTelemetry instrumentation tasks across 11 languages, with context propagation as a key failure mode despite much higher scores on coding-only tests. In contrast, Syncause boosted SWE-bench Verified fixes to 83.4% by adding dynamic tracing “Runtime Facts” to the Live-SWE-agent with Gemini 3 Pro, detailing methods and open-sourcing trajectories and code in their [blog](https://syn-cause.com/blog/swe-bench-verified-83) and [repo](https://github.com/Syncause/syncause-swebench). Complementing this, new research on cross-domain workflow generation proposes a decompose–recompose–decide method that surpasses 20-iteration refinement baselines in a single pass, reducing latency and cost for agentic orchestration ([paper](https://arxiv.org/html/2602.11114v1)). For hands-on adoption, the open-source [DeepCode](https://github.com/HKUDS/DeepCode) project provides multi-agent “Text2Backend” capabilities to prototype structured, telemetry-aware coding agents.

calendar_today 2026-02-12
quesma otelbench opentelemetry google-gemini-3-pro syncause

GLM-5 and MiniMax M2.5 push low-cost, agentic coding into production range

Two Chinese releases—Zhipu AI’s GLM-5 and MiniMax M2.5—signal a shift toward affordable, agentic coding models that challenge frontier systems on practical benchmarks. Zhipu AI’s GLM-5 is positioned as an MIT-licensed open model with a native Agent Mode that rivals proprietary leaders on multiple benchmarks, with a deep-dive detailing its pre-launch appearance under a pseudonym and hints from vLLM pull requests ([official overview](https://z.ai/blog/glm-5?_bhlid=d84a093754c9e11cb0d2e9ff416fd99cb5f0e2da), [leak analysis](https://medium.com/reading-sh/glm-5-chinas-745b-parameter-open-source-model-that-leaked-before-it-launched-b2cfbafe99ef?source=rss-8af100df272------2), [weights claim](https://medium.com/ai-software-engineer/glm-5-arrive-with-a-bang-from-vibe-coding-to-agentic-engineering-disrupts-opus-b2b13f02b819)). MiniMax’s M2.5 posts strong results on coding and agentic tasks—80.2% SWE-Bench Verified, 51.3% Multi-SWE-Bench, 76.3% BrowseComp—while running 37% faster than M2.1 and costing roughly $1/hour at 100 tokens/sec (or $0.30/hour at 50 tps), with speed reportedly matching Claude Opus 4.6 ([release details](https://www.minimax.io/news/minimax-m25)). For developer workflows, quick-start videos show GLM-5 (and similarly Kimi K2.5) slotting into Claude Code with minimal setup, lowering trial friction inside existing IDEs ([GLM-5 with Claude Code](https://www.youtube.com/watch?v=Ey-HW-nJBiw&pp=ygURQ3Vyc29yIElERSB1cGRhdGU%3D), [Kimi K2.5 with Claude Code](https://www.youtube.com/watch?v=yZtLwOhmHps&pp=ygURQ3Vyc29yIElERSB1cGRhdGU%3D)).

calendar_today 2026-02-12
zhipu-ai glm-5 minimax minimax-m25 openrouter

OpenAI ships GPT-5.3-Codex into IDEs, terminals, web, and a macOS app

OpenAI launched GPT-5.3-Codex, a faster coding model now embedded in IDEs, the terminal, web, and a macOS app, with early claims it assisted in building itself. OpenAI details ~25% faster runs, stronger SWE-Bench/Terminal-Bench results, and broad distribution via CLI, IDE extensions, web, and a new macOS app in the announcement [Introducing GPT‑5.3‑Codex](https://openai.com/index/introducing-gpt-5-3-codex/)[^1]. Coverage notes all paid ChatGPT plans can access it now, API access is coming, and the team used Codex to debug, manage deployment, and evaluate results during its own development [TechRadar report](https://www.techradar.com/pro/openai-unveils-gpt-5-3-codex-which-can-tackle-more-advanced-and-complex-coding-tasks)[^2], with additional workflow and positioning details on distribution and SDLC scope [AI News Hub](https://www.chatai.com/posts/openai-pushes-codex-deeper-into-developer-workflows-with-gpt-5-3-codex-release)[^3]. [^1]: Adds: Official feature, performance, and distribution overview. [^2]: Adds: Access paths (paid ChatGPT plans), benchmarks, and "built itself" context. [^3]: Adds: Deeper coverage of IDE/CLI/macOS integration, speedup figure, and API timing.

calendar_today 2026-02-07
openai gpt-53-codex chatgpt codex-macos-app gpt-5-3-codex

Mixture-of-Models router tops single LLMs on SWE-Bench Verified (75.6%)

A lightweight router that clusters tasks and selects the historically best model per cluster hit 75.6% on SWE-Bench Verified, narrowly outperforming top single-model baselines (~74%). Details and methodology are outlined in Nordlys Labs' write-up, including semantic clustering and per-cluster success routing without test-time search or repo execution [Nordlys Labs blog](https://nordlyslabs.com/blog/hypernova)[^1]. The open-source framework implementing this mixture-of-models approach is available here [Nordlys GitHub](https://github.com/Nordlys-Labs/nordlys)[^2]. [^1]: Adds: methodology, routing design, and reported benchmark results. [^2]: Adds: production-ready code for the router and integrations.

calendar_today 2026-02-07
nordlys-labs nordlys swe-bench swe-bench-verified llm-routing

Reports on Claude Sonnet 5’s SWE-bench leap and the rising value of context engines

Early reports suggest Anthropic’s new Claude Sonnet 5 sets a reported 82.1% on SWE-bench with 1M-token context, positioning it as a top coding agent for multi-repo workstreams [Vertu review](https://vertu.com/ai-tools/claude-sonnet-5-released-the-fennec-leak-antigravity-support-and-the-new-swe-bench-sota/?srsltid=AfmBOootYl50lkFfR364PidEU5-t-oscjkVho1kk36G3wJVnw2snSoQG)[^1] and drawing early hands-on validation from the community [early test video](https://www.youtube.com/watch?v=_87CirMQ1FM&pp=ygUXbmV3IEFJIG1vZGVsIGZvciBjb2Rpbmc%3D)[^2]. Independent evals also show the context layer matters as much as the model: a Claude Sonnet 4.5 agent augmented with Bito’s AI Architect context engine hit 60.8% on SWE-Bench Pro vs. 43.6% baseline (a 39% relative gain) [AI-Tech Park](https://ai-techpark.com/bitos-ai-architect-achieves-highest-success-rate-of-60-8-on-swe-bench-pro/)[^3]. Meanwhile, Anthropic committed to keeping Claude ad-free, underscoring enterprise trust and reducing incentive risks in assistant-driven workflows [Anthropic announcement](https://www.anthropic.com/news/claude-is-a-space-to-think)[^4]. [^1]: Roundup of Sonnet 5 claims (SWE-bench score, long context) and deployment notes. [^2]: Practitioner-level early testing and impressions on capabilities/cost. [^3]: Third-party evaluation showing large gains from a codebase knowledge graph context engine. [^4]: Official policy stance on ad-free Claude, relevant for compliance and procurement.

calendar_today 2026-02-04
anthropic claude claude-sonnet-5 bito ai-architect

AI coding agents: benchmarks mislead—separate generation from review

Benchmarks like SWE-bench reward pass/fail test outcomes, not maintainability or security, creating a false sense of readiness for AI-generated code; leaders should decouple "bookkeeping" (generation) from "auditing" with independent review gates and specialized tooling [Benchmarks Are Making AI Coding Look Safer Than It Is](https://deepengineering.substack.com/p/benchmarks-are-making-ai-coding-look)[^1]. In practice, agents already excel at tireless refactors and boilerplate, shifting the bottleneck from typing to ideation—use them for bulk fixes while tightening review policies and prompts [Six reasons to use coding agents](https://www.infoworld.com/article/4126558/six-reasons-to-use-coding-agents.html)[^2]. Practitioners also advocate simple, bash-first harnesses to contain agent workflows and reduce risk in CI/CD, avoiding “agent sprawl” and keeping orchestration deterministic [Pi – The AI Harness That Powers OpenClaw](https://www.youtube.com/watch?v=AEmHcFH1UgQ&pp=ygUYQUkgY29kaW5nIGFnZW50IHdvcmtmbG93)[^3]. [^1]: Explains why SWE-bench over-indexes on code generation, highlights review fatigue/quality rot, and argues for independent auditing (includes Qodo perspective). [^2]: Details concrete strengths of coding agents (repetitive tasks, speed, idea throughput) and how they change developer workflows. [^3]: Discusses risks of agents, “Bash is all you need,” and harnessed workflows to adapt safely within CI/CD.

calendar_today 2026-02-04
qodo ai-coding-agents code-quality ci-cd bash

Mixture-of-Models routing tops single LLMs on SWE-Bench via task specialization

A lightweight Mixture-of-Models router that assigns issues to semantic clusters and routes to the historically strongest model per cluster hit 75.6% on SWE-Bench, edging past single-model baselines (~74%) by exploiting complementary strengths rather than defaulting to the top aggregate model [Reddit summary](https://www.reddit.com/r/LocalLLaMA/comments/1qvm0ft/mixtureofmodels_routing_beats_single_llms_on/)[^1]. The authors share a methodology write-up and an open-source framework so teams can reproduce the gating approach without test-time search or repo execution [methodology blog](https://nordlyslabs.com/blog/hypernova)[^2] and [framework code](https://github.com/Nordlys-Labs/nordlys)[^3]. [^1]: Highlights task-level specialization on SWE-Bench and the routing approach with reported results. [^2]: Details the clustering, per-model success statistics, and routing mechanism. [^3]: Provides the open-source implementation for building a MoM router.

calendar_today 2026-02-04
nordlys nordlys-labs swe-bench mixture-of-models model-routing

Coding agents: smarter context and sequential planning beat model-only upgrades

Third‑party tests show Bito’s AI Architect lifted a Claude Sonnet 4.5 agent to 60.8% on SWE‑Bench Pro by adding MCP‑delivered codebase intelligence—up from 43.6% without it—with large gains across UI/UX, performance, critical, and security bugs ([Bito’s results](https://www.tipranks.com/news/private-companies/bitos-ai-architect-sets-new-swe-bench-pro-high-underscoring-strategic-edge-in-enterprise-coding-agents)[^1]). In parallel, a sequential plan‑reflection research agent (“Deep Researcher”) outperformed peers on DeepResearch Bench, indicating orchestration and iterative context refinement can outpace parallel scaling alone ([Deep Researcher](https://quantumzeitgeist.com/deep-researcher-achieves-phd-level-reports/)[^2]). [^1]: Independent evaluation by The Context Lab holding the model constant; details on SWE‑Bench Pro lift and task‑level gains via MCP-based context. [^2]: Explains sequential plan‑reflection and candidates crossover, with benchmark results vs. other research agents.

calendar_today 2026-02-03
bito bito-ai-architect claude-sonnet-45 the-context-lab deep-researcher

E2E coding agents: 27% pass, cheaper scaling, and safer adoption

A new end-to-end benchmark, [ProjDevBench](https://arxiv.org/html/2602.01655v1)[^1] with [code](https://github.com/zsworld6/projdevbench)[^2], reports only 27.38% acceptance for agent-built repos, highlighting gaps in system design, complexity, and resource management. Efficiency is improving: [SWE-Replay](https://quantumzeitgeist.com/17-4-percent-performance-swe-replay-achieves-gain-efficient/)[^3] recycles prior agent trajectories to cut test-time compute by up to 17.4% while maintaining or slightly improving fix rates. For evaluation and safety, Together AI shows open LLM judges can beat GPT‑5.2 on preference alignment ([post](https://www.together.ai/blog/fine-tuning-open-llm-judges-to-outperform-gpt-5-2at/))[^5], Java teams get a pragmatic path via [ASTRA‑LangChain4j](https://quantumzeitgeist.com/ai-astra-langchain4j-achieves-llm-integration/)[^6], and an open‑weight coding LM targets agentic/local dev ([Qwen3‑Coder‑Next](https://www.youtube.com/watch?v=UwVi2iu-xyA&pp=ygURU1dFLWJlbmNoIHJlc3VsdHM%3D))[^7]. [^1]: Adds: defines an E2E agent benchmark with architecture, correctness, and refinement criteria plus pass-rate findings. [^2]: Adds: benchmark repository for tasks, harnesses, and evaluation assets. [^3]: Adds: test-time scaling via trajectory replay with up to 17.4% cost reduction and small performance gains on SWE-Bench variants. [^4]: Adds: DPO-tuned open "LLM-as-judge" models outperform GPT‑5.2 on RewardBench 2 preference alignment, with code/how-to. [^5]: Adds: security analysis of self-propagating adversarial prompts ("prompt worms") and the OpenClaw agent network example. [^6]: Adds: Java integration pattern for agent+LLM via ASTRA modules and LangChain4J, including BeliefRAG and Maven packaging. [^7]: Adds: open-weight coding model positioned for agentic workflows and local development.

calendar_today 2026-02-03
projdevbench swe-replay swe-bench-verified swe-bench-pro astra

Agentic AI: architecture patterns and what to measure before you ship

A new survey consolidates how LLM-based agents are built—policy/LLM core, memory, planners, tool routers, and critics—plus orchestration choices (single vs multi-agent) and deployment modes. It highlights practical trade-offs (latency vs accuracy, autonomy vs control) and evaluation pitfalls like hidden costs from retries and context growth, and the need for guardrails around tool actions. Benchmarks such as WebArena, ToolBench, SWE-bench, and GAIA illustrate task design and measurement under real constraints.

calendar_today 2026-01-06
llm-agents tool-calling rag swe-bench webarena