SWE-Bench Pro leaderboard: small gains a…

SWE-BENCH-PRO PUB_DATE: 2026.04.04

SWE-BENCH PRO LEADERBOARD: SMALL GAINS AT THE TOP, BIG CONTEXTS, AND MOSTLY SELF-REPORTED RESULTS

A new SWE-Bench Pro leaderboard shows top code models clustered around 0.55–0.58, with large contexts and self-reported scores. The updated [SWE-Bench Pro lead...

A new SWE-Bench Pro leaderboard shows top code models clustered around 0.55–0.58, with large contexts and self-reported scores.

The updated SWE-Bench Pro leaderboard lists 10 models, all self-reported, with GPT-5.4 leading at 0.577 and an average of 0.551. It highlights big contexts (up to 1.0M tokens) and per-token prices (e.g., GPT-5.4 shown as $2.50 / $15.00). Gemini 3.1 Pro appears at 0.542; several MiniMax and Qwen entries sit near the top.

Dataset and code are marked “Soon,” and verification shows 0 of 10 results confirmed. Treat this as directional signal for code agents and long-context reasoning, and validate on your repos.

A related Medium piece on market split could not be loaded, and an X post failed to render, so this brief focuses on the leaderboard data only.

[ WHY_IT_MATTERS ]

01.

If you’re picking a model for code agents or automated bug fixing, this is a current snapshot of who’s competitive.

02.

Large contexts and price spread affect how you architect routing, caching, and budget for developer automation.

[ WHAT_TO_TEST ]

terminal
Run a head-to-head on your own bug-fix tasks: GPT-5.4 vs a cheaper option (e.g., GPT-5.4 mini) using identical prompts and tools.
terminal
Stress-test long-context workflows (500K–1M tokens) with real repos to check latency, truncation behavior, and tool-use stability.

[ BROWNFIELD_PERSPECTIVE ]

Legacy codebase integration strategies...

01.
Pilot a router: send easy tickets to a cheaper model, escalate complex multi-file changes to a top model; measure cost per accepted fix.
02.
Add guardrails: pinned-tool prompts, diff validation, and test-first patches to reduce bad merges from aggressive agents.

[ GREENFIELD_PERSPECTIVE ]

Fresh architecture paradigms...

01.
Design for long context from day one: chunking, retrieval, and plan-and-execute loops that exploit 400K–1M token windows.
02.
Build an eval harness mirroring SWE-Bench-style tasks on your codebase; automate weekly regression checks across candidate models.

arrow_back

PREVIOUS_DATA_LOG

—

Initialize_Return_to_Core

LINK_STATUS: 127.0.0.1 (SECURE)

NEXT_DATA_LOG

Choosing the right frontier model by workflow: compliance, agents, and file-heavy work

arrow_forward