SWE-BENCH-PRO PUB_DATE: 2026.04.04

SWE-BENCH PRO LEADERBOARD: SMALL GAINS AT THE TOP, BIG CONTEXTS, AND MOSTLY SELF-REPORTED RESULTS

A new SWE-Bench Pro leaderboard shows top code models clustered around 0.55–0.58, with large contexts and self-reported scores. The updated [SWE-Bench Pro lead...

SWE-Bench Pro leaderboard: small gains at the top, big contexts, and mostly self-reported results

A new SWE-Bench Pro leaderboard shows top code models clustered around 0.55–0.58, with large contexts and self-reported scores.

The updated SWE-Bench Pro leaderboard lists 10 models, all self-reported, with GPT-5.4 leading at 0.577 and an average of 0.551. It highlights big contexts (up to 1.0M tokens) and per-token prices (e.g., GPT-5.4 shown as $2.50 / $15.00). Gemini 3.1 Pro appears at 0.542; several MiniMax and Qwen entries sit near the top.

Dataset and code are marked “Soon,” and verification shows 0 of 10 results confirmed. Treat this as directional signal for code agents and long-context reasoning, and validate on your repos.

A related Medium piece on market split could not be loaded, and an X post failed to render, so this brief focuses on the leaderboard data only.

[ WHY_IT_MATTERS ]
01.

If you’re picking a model for code agents or automated bug fixing, this is a current snapshot of who’s competitive.

02.

Large contexts and price spread affect how you architect routing, caching, and budget for developer automation.

[ WHAT_TO_TEST ]
  • terminal

    Run a head-to-head on your own bug-fix tasks: GPT-5.4 vs a cheaper option (e.g., GPT-5.4 mini) using identical prompts and tools.

  • terminal

    Stress-test long-context workflows (500K–1M tokens) with real repos to check latency, truncation behavior, and tool-use stability.

[ BROWNFIELD_PERSPECTIVE ]

Legacy codebase integration strategies...

  • 01.

    Pilot a router: send easy tickets to a cheaper model, escalate complex multi-file changes to a top model; measure cost per accepted fix.

  • 02.

    Add guardrails: pinned-tool prompts, diff validation, and test-first patches to reduce bad merges from aggressive agents.

[ GREENFIELD_PERSPECTIVE ]

Fresh architecture paradigms...

  • 01.

    Design for long context from day one: chunking, retrieval, and plan-and-execute loops that exploit 400K–1M token windows.

  • 02.

    Build an eval harness mirroring SWE-Bench-style tasks on your codebase; automate weekly regression checks across candidate models.

SUBSCRIBE_FEED
Get the digest delivered. No spam.