E2E coding agents: 27% pass, cheaper scaling, and safer adoption

PROJDEVBENCH PUB_DATE: 2026.02.03

A new end-to-end benchmark, [ProjDevBench](https://arxiv.org/html/2602.01655v1)[^1] with [code](https://github.com/zsworld6/projdevbench)[^2], reports only 27.3...

A new end-to-end benchmark, ProjDevBench¹ with code², reports only 27.38% acceptance for agent-built repos, highlighting gaps in system design, complexity, and resource management. Efficiency is improving: SWE-Replay³ recycles prior agent trajectories to cut test-time compute by up to 17.4% while maintaining or slightly improving fix rates. For evaluation and safety, Together AI shows open LLM judges can beat GPT‑5.2 on preference alignment post⁵, Java teams get a pragmatic path via ASTRA‑LangChain4j⁶, and an open‑weight coding LM targets agentic/local dev Qwen3‑Coder‑Next⁷.

Adds: defines an E2E agent benchmark with architecture, correctness, and refinement criteria plus pass-rate findings. ↩
Adds: benchmark repository for tasks, harnesses, and evaluation assets. ↩
Adds: test-time scaling via trajectory replay with up to 17.4% cost reduction and small performance gains on SWE-Bench variants. ↩
Adds: DPO-tuned open "LLM-as-judge" models outperform GPT‑5.2 on RewardBench 2 preference alignment, with code/how-to. ↩
Adds: security analysis of self-propagating adversarial prompts ("prompt worms") and the OpenClaw agent network example. ↩
Adds: Java integration pattern for agent+LLM via ASTRA modules and LangChain4J, including BeliefRAG and Maven packaging. ↩
Adds: open-weight coding model positioned for agentic workflows and local development. ↩

[ WHY_IT_MATTERS ]

01.

E2E success is still low, so teams need realistic benchmarks and cost-aware scaling to avoid overpromising agent capabilities.

02.

Better judges and security patterns reduce regressions and mitigate risks from autonomous, networked agents.

[ WHAT_TO_TEST ]

terminal
Run ProjDevBench tasks on your agent stack and gate outputs with an open LLM judge to quantify quality and drift.
terminal
Add a trajectory archive (SWE-Replay style) to agent retries and measure cost/latency vs. pass-rate deltas in CI.

[ BROWNFIELD_PERSPECTIVE ]

Legacy codebase integration strategies...

01.
Wrap existing CI/CD with judge-based checks, tool allowlists, and sandboxed execution before enabling agent autonomy.
02.
For Java services, integrate LLM calls via ASTRA‑LangChain4j behind feature flags with audit logging and rollback.

[ GREENFIELD_PERSPECTIVE ]

Fresh architecture paradigms...

01.
Design agent-first workflows with ephemeral environments, secrets isolation, and human-in-the-loop checkpoints.
02.
Prototype locally with open-weight coding LMs and plug in judge models early for PR review and regression scoring.

arrow_back

PREVIOUS_DATA_LOG

CORE: Persistent memory and actions for coding agents via MCP

Initialize_Return_to_Core

LINK_STATUS: 127.0.0.1 (SECURE)

NEXT_DATA_LOG

Coding agents: smarter context and sequential planning beat model-only upgrades

arrow_forward