AGENTIC-QE SHIPS RUNTIME “ORACLE” EVALS, DURABLE-FIRST TESTS, AND A STABILITY LAYER
Agentic-QE now grades generated tests by running them against real and deliberately-broken code, and locks down its CLI/API behavior. The new release of Agenti...
Agentic-QE now grades generated tests by running them against real and deliberately-broken code, and locks down its CLI/API behavior.
The new release of Agentic-QE adds runtime “oracle evals,” durable-first test generation, mutation scoring, cheaper model lanes, and a “conservation” layer to stabilize commands and outputs — all outlined in the v3.11.3 notes here.
This lines up with field reports that agent reliability is a variance/tail problem, not a single-step accuracy problem; you have to design for deadlines, budgets, and retries, not hope for averages — see the production lessons in Tail Control.
If you’re also wrangling specs and change-management, OpenSpec’s early “Stores” beta v1.5.0 is a parallel attempt to bring order (with breaking changes likely), and the context-lock‑in brief here explains why reliability plus stable interfaces tend to anchor model choices over time.
Runtime oracle evals surface silent failures and flaky tests that keyword-based checks miss, improving agent reliability under real deadlines.
A stabilized CLI/API reduces integration drift in CI/CD and scripting, cutting maintenance on test infra for agents.
-
terminal
Run AQE oracle evals on a service you suspect is flaky; compare mutation score vs. coverage and see which tests survive refactors.
-
terminal
Benchmark the cheaper model lanes vs. your frontier default using eval:live on your codebase; track pass rate, latency tails, and cost.
Legacy codebase integration strategies...
- 01.
Gate merges with oracle evals in canary mode first; pin AQE v3.11.3 and audit scripts against the conservation layer before flipping default.
- 02.
Instrument tail latency budgets from the edge (gateway timeout) down; fail fast and retry upstream rather than timing out inside an agent loop.
Fresh architecture paradigms...
- 01.
Start with durable tests (invariants, contracts, properties) and golden datasets as first-class assets; wire oracle evals into CI from day one.
- 02.
If adopting OpenSpec Stores beta, isolate its usage behind a thin adapter to absorb breaking changes while the model stabilizes.
Get daily AGENTIC-WORKFLOWS + SDLC updates.
- Practical tactics you can ship tomorrow
- Tooling, workflows, and architecture notes
- One short email each weekday