QWEN3 PUB_DATE: 2026.01.20

ABC-BENCH: END-TO-END BENCHMARK FOR AGENTIC BACKEND CODING

ABC-Bench evaluates LLM agents on real backend tasks from repo exploration through Dockerization, service deployment, and end-to-end API testing. It includes 22...

ABC-Bench: End-to-end benchmark for agentic backend coding

ABC-Bench evaluates LLM agents on real backend tasks from repo exploration through Dockerization, service deployment, and end-to-end API testing. It includes 224 tasks across 8 languages and 19 frameworks and shows that current models underperform on full lifecycle work. The dataset and two Qwen3-based agent variants are open-sourced for experimentation.

[ WHY_IT_MATTERS ]
01.

It measures agent reliability on practical tasks beyond code snippets, aligning with how backend teams actually build and ship services.

02.

It offers a reproducible, containerized benchmark to compare agents and guide where to invest tooling or fine-tuning.

[ WHAT_TO_TEST ]
  • terminal

    Baseline your code agent against ABC-Bench and track failure modes in env setup, Dockerfiles, dependency management, and API contract handling.

  • terminal

    Test tool augmentation (e.g., shell, package managers, Docker) or fine-tuning on a subset to see if pass rates and deployment robustness improve.

[ BROWNFIELD_PERSPECTIVE ]

Legacy codebase integration strategies...

  • 01.

    Use ABC-Bench tasks that match your stack to validate agents before granting repo access, and gate changes via PRs that surface Dockerfile and dependency diffs.

  • 02.

    Check agent compatibility with your CI/CD and container images and measure API regressions against existing contract tests.

[ GREENFIELD_PERSPECTIVE ]

Fresh architecture paradigms...

  • 01.

    Structure new services with clear Dockerfiles, Makefiles, and API tests so agents can automate setup, deployment, and verification.

  • 02.

    Prefer frameworks represented in ABC-Bench to leverage ready-made tasks for continuous agent benchmarking.