SWE-BENCH PRO LEADERBOARD: SMALL GAINS AT THE TOP, BIG CONTEXTS, AND MOSTLY SELF-REPORTED RESULTS
A new SWE-Bench Pro leaderboard shows top code models clustered around 0.55–0.58, with large contexts and self-reported scores. The updated [SWE-Bench Pro lead...
A new SWE-Bench Pro leaderboard shows top code models clustered around 0.55–0.58, with large contexts and self-reported scores.
The updated SWE-Bench Pro leaderboard lists 10 models, all self-reported, with GPT-5.4 leading at 0.577 and an average of 0.551. It highlights big contexts (up to 1.0M tokens) and per-token prices (e.g., GPT-5.4 shown as $2.50 / $15.00). Gemini 3.1 Pro appears at 0.542; several MiniMax and Qwen entries sit near the top.
Dataset and code are marked “Soon,” and verification shows 0 of 10 results confirmed. Treat this as directional signal for code agents and long-context reasoning, and validate on your repos.
A related Medium piece on market split could not be loaded, and an X post failed to render, so this brief focuses on the leaderboard data only.
If you’re picking a model for code agents or automated bug fixing, this is a current snapshot of who’s competitive.
Large contexts and price spread affect how you architect routing, caching, and budget for developer automation.
-
terminal
Run a head-to-head on your own bug-fix tasks: GPT-5.4 vs a cheaper option (e.g., GPT-5.4 mini) using identical prompts and tools.
-
terminal
Stress-test long-context workflows (500K–1M tokens) with real repos to check latency, truncation behavior, and tool-use stability.
Legacy codebase integration strategies...
- 01.
Pilot a router: send easy tickets to a cheaper model, escalate complex multi-file changes to a top model; measure cost per accepted fix.
- 02.
Add guardrails: pinned-tool prompts, diff validation, and test-first patches to reduce bad merges from aggressive agents.
Fresh architecture paradigms...
- 01.
Design for long context from day one: chunking, retrieval, and plan-and-execute loops that exploit 400K–1M token windows.
- 02.
Build an eval harness mirroring SWE-Bench-style tasks on your codebase; automate weekly regression checks across candidate models.