SWE-CI shifts agent evaluation from one-…

HUGGING-FACE PUB_DATE: 2026.03.19

SWE-CI SHIFTS AGENT EVALUATION FROM ONE-SHOT BUG FIXES TO CI-DRIVEN MAINTAINABILITY

A new CI-loop benchmark, SWE-CI, measures whether AI coding agents can maintain real repositories over time, not just pass one-off tests. [SWE-CI](https://arxi...

A new CI-loop benchmark, SWE-CI, measures whether AI coding agents can maintain real repositories over time, not just pass one-off tests.

SWE-CI packages 100 real repos averaging 233 active days and 71 commits per task, then scores agents across dozens of iterations to track how functional correctness evolves. The dataset is on Hugging Face with code at GitHub.

For teams building agentic workflows, this gives a maintainability yardstick that single-shot benchmarks miss. Pair it with production patterns like verification loops and tool delegation from Agentic Patterns Developers Should Steal, stronger context engineering from Daily Dose of Data Science, and keep an eye on open evaluation issues called out in The AI Report.

[ WHY_IT_MATTERS ]

01.

We finally have a benchmark that pressures agents on long-horizon code quality, not just isolated test passes.

02.

It informs how to gate agentic PRs in CI with real drift and regression signals.

[ WHAT_TO_TEST ]

terminal
Mirror an internal service into a SWE-CI–style track: replay a commit history, run your agent behind CI, and measure pass-rate drift across iterations.
terminal
A/B an agent with verification loops and deterministic tools vs free-form edits; compare regression count, review time, and compute cost.

[ BROWNFIELD_PERSPECTIVE ]

Legacy codebase integration strategies...

01.
Introduce an agentic PR bot for low-risk chores (docs, lint, small refactors) behind feature flags and human review gates.
02.
Add context budgets, personas, and guardrails; log traces and outcomes to catch anchoring bias and regressions early.

[ GREENFIELD_PERSPECTIVE ]

Fresh architecture paradigms...

01.
Design agent-first services with plan → act → verify loops, context stores, and observable state machines from day one.
02.
Prototype with multi-agent orchestration (e.g., planner, implementer, reviewer) and measure maintainability using SWE-CI–like tracks.

arrow_back

PREVIOUS_DATA_LOG

Open-weight coding agents hit 60%+ SWE-Bench and get easier to run on-prem

Initialize_Return_to_Core

LINK_STATUS: 127.0.0.1 (SECURE)

NEXT_DATA_LOG

Edge.js arrives: sandboxed Node.js for AI and edge; LangChain tightens security

arrow_forward