AGENT EVALS ARE NOW SYSTEM TESTS, NOT MODEL TESTS
Coding AI moved from single-shot prompts to agents you must evaluate as full systems. The new [Promptfoo agent eval guide](https://www.promptfoo.dev/docs/guide...
Coding AI moved from single-shot prompts to agents you must evaluate as full systems.
The new Promptfoo agent eval guide (Promptfoo is now part of OpenAI) reframes testing around runtime tiers—plain LLM, SDK-based agent, and rich client/server—where tool access, safety posture, and state drive outcomes. It pushes teams to log intermediate steps, cost, and latency, not just final answers.
Benchmarks are also splitting. A short GLM-5.1 SWE-Bench explainer shows why “Verified” vs “Pro” scores diverge, while vendor videos tout wins on assorted suites (e.g., MiMo V2.5 Pro). The takeaway: fix your runtime boundary and scoring rubric before you compare anything.
Evaluating the system (tools, state, safety) reveals cost, latency, and failure modes hidden by final-answer scoring.
SWE-Bench variants score differently; locked-down, reproducible harnesses stop apples-to-oranges comparisons.
-
terminal
Run the same model as plain LLM vs SDK-based agent on the same patch set; compare success, steps, tool calls, cost, and wall time.
-
terminal
Reproduce a SWE-Bench Verified run, then switch to Pro and document the delta; pin seeds and runtime boundary for both.
Legacy codebase integration strategies...
- 01.
Add an agent-eval job to CI with a read-only sandbox and explicit tool allowlists; fail on cost/latency regressions or policy violations.
- 02.
Standardize on one agent tier (SDK vs app-server) per pipeline and store full traces for diffing across upgrades.
Fresh architecture paradigms...
- 01.
Design agents eval-first: step budgets, cost caps, tool permissions, and required traces baked in from day one.
- 02.
Choose the minimal runtime tier that meets needs; simpler boundaries reduce variance and blast radius.
Get daily PROMPTFOO + SDLC updates.
- Practical tactics you can ship tomorrow
- Tooling, workflows, and architecture notes
- One short email each weekday