PROJDEVBENCH PUB_DATE: 2026.02.03

E2E CODING AGENTS: 27% PASS, CHEAPER SCALING, AND SAFER ADOPTION

A new end-to-end benchmark, [ProjDevBench](https://arxiv.org/html/2602.01655v1)[^1] with [code](https://github.com/zsworld6/projdevbench)[^2], reports only 27.3...

A new end-to-end benchmark, ProjDevBench1 with code2, reports only 27.38% acceptance for agent-built repos, highlighting gaps in system design, complexity, and resource management. Efficiency is improving: SWE-Replay3 recycles prior agent trajectories to cut test-time compute by up to 17.4% while maintaining or slightly improving fix rates. For evaluation and safety, Together AI shows open LLM judges can beat GPT‑5.2 on preference alignment post5, Java teams get a pragmatic path via ASTRA‑LangChain4j6, and an open‑weight coding LM targets agentic/local dev Qwen3‑Coder‑Next7.

  1. Adds: defines an E2E agent benchmark with architecture, correctness, and refinement criteria plus pass-rate findings. 

  2. Adds: benchmark repository for tasks, harnesses, and evaluation assets. 

  3. Adds: test-time scaling via trajectory replay with up to 17.4% cost reduction and small performance gains on SWE-Bench variants. 

  4. Adds: DPO-tuned open "LLM-as-judge" models outperform GPT‑5.2 on RewardBench 2 preference alignment, with code/how-to. 

  5. Adds: security analysis of self-propagating adversarial prompts ("prompt worms") and the OpenClaw agent network example. 

  6. Adds: Java integration pattern for agent+LLM via ASTRA modules and LangChain4J, including BeliefRAG and Maven packaging. 

  7. Adds: open-weight coding model positioned for agentic workflows and local development. 

[ WHY_IT_MATTERS ]
01.

E2E success is still low, so teams need realistic benchmarks and cost-aware scaling to avoid overpromising agent capabilities.

02.

Better judges and security patterns reduce regressions and mitigate risks from autonomous, networked agents.

[ WHAT_TO_TEST ]
  • terminal

    Run ProjDevBench tasks on your agent stack and gate outputs with an open LLM judge to quantify quality and drift.

  • terminal

    Add a trajectory archive (SWE-Replay style) to agent retries and measure cost/latency vs. pass-rate deltas in CI.

[ BROWNFIELD_PERSPECTIVE ]

Legacy codebase integration strategies...

  • 01.

    Wrap existing CI/CD with judge-based checks, tool allowlists, and sandboxed execution before enabling agent autonomy.

  • 02.

    For Java services, integrate LLM calls via ASTRA‑LangChain4j behind feature flags with audit logging and rollback.

[ GREENFIELD_PERSPECTIVE ]

Fresh architecture paradigms...

  • 01.

    Design agent-first workflows with ephemeral environments, secrets isolation, and human-in-the-loop checkpoints.

  • 02.

    Prototype locally with open-weight coding LMs and plug in judge models early for PR review and regression scoring.