ABC-BENCH: END-TO-END BENCHMARK FOR AGENTIC BACKEND CODING
ABC-Bench evaluates LLM agents on real backend tasks from repo exploration through Dockerization, service deployment, and end-to-end API testing. It includes 22...
ABC-Bench evaluates LLM agents on real backend tasks from repo exploration through Dockerization, service deployment, and end-to-end API testing. It includes 224 tasks across 8 languages and 19 frameworks and shows that current models underperform on full lifecycle work. The dataset and two Qwen3-based agent variants are open-sourced for experimentation.
It measures agent reliability on practical tasks beyond code snippets, aligning with how backend teams actually build and ship services.
It offers a reproducible, containerized benchmark to compare agents and guide where to invest tooling or fine-tuning.
-
terminal
Baseline your code agent against ABC-Bench and track failure modes in env setup, Dockerfiles, dependency management, and API contract handling.
-
terminal
Test tool augmentation (e.g., shell, package managers, Docker) or fine-tuning on a subset to see if pass rates and deployment robustness improve.
Legacy codebase integration strategies...
- 01.
Use ABC-Bench tasks that match your stack to validate agents before granting repo access, and gate changes via PRs that surface Dockerfile and dependency diffs.
- 02.
Check agent compatibility with your CI/CD and container images and measure API regressions against existing contract tests.
Fresh architecture paradigms...
- 01.
Structure new services with clear Dockerfiles, Makefiles, and API tests so agents can automate setup, deployment, and verification.
- 02.
Prefer frameworks represented in ABC-Bench to leverage ready-made tasks for continuous agent benchmarking.