Agentic-QE ships runtime “oracle” evals,…

AGENTIC-WORKFLOWS PUB_DATE: 2026.06.29

AGENTIC-QE SHIPS RUNTIME “ORACLE” EVALS, DURABLE-FIRST TESTS, AND A STABILITY LAYER

Agentic-QE now grades generated tests by running them against real and deliberately-broken code, and locks down its CLI/API behavior. The new release of Agenti...

Agentic-QE now grades generated tests by running them against real and deliberately-broken code, and locks down its CLI/API behavior.

The new release of Agentic-QE adds runtime “oracle evals,” durable-first test generation, mutation scoring, cheaper model lanes, and a “conservation” layer to stabilize commands and outputs — all outlined in the v3.11.3 notes here.
This lines up with field reports that agent reliability is a variance/tail problem, not a single-step accuracy problem; you have to design for deadlines, budgets, and retries, not hope for averages — see the production lessons in Tail Control.
If you’re also wrangling specs and change-management, OpenSpec’s early “Stores” beta v1.5.0 is a parallel attempt to bring order (with breaking changes likely), and the context-lock‑in brief here explains why reliability plus stable interfaces tend to anchor model choices over time.

[ WHY_IT_MATTERS ]

01.

Runtime oracle evals surface silent failures and flaky tests that keyword-based checks miss, improving agent reliability under real deadlines.

02.

A stabilized CLI/API reduces integration drift in CI/CD and scripting, cutting maintenance on test infra for agents.

[ WHAT_TO_TEST ]

terminal
Run AQE oracle evals on a service you suspect is flaky; compare mutation score vs. coverage and see which tests survive refactors.
terminal
Benchmark the cheaper model lanes vs. your frontier default using eval:live on your codebase; track pass rate, latency tails, and cost.

[ BROWNFIELD_PERSPECTIVE ]

Legacy codebase integration strategies...

01.
Gate merges with oracle evals in canary mode first; pin AQE v3.11.3 and audit scripts against the conservation layer before flipping default.
02.
Instrument tail latency budgets from the edge (gateway timeout) down; fail fast and retry upstream rather than timing out inside an agent loop.

[ GREENFIELD_PERSPECTIVE ]

Fresh architecture paradigms...

01.
Start with durable tests (invariants, contracts, properties) and golden datasets as first-class assets; wire oracle evals into CI from day one.
02.
If adopting OpenSpec Stores beta, isolate its usage behind a thin adapter to absorb breaking changes while the model stabilizes.

Enjoying_this_story?

Get daily AGENTIC-WORKFLOWS + SDLC updates.

Practical tactics you can ship tomorrow
Tooling, workflows, and architecture notes
One short email each weekday

arrow_back

PREVIOUS_DATA_LOG

Open Qwen 3.5 narrows the SWE-bench gap with closed models

Initialize_Return_to_Core

LINK_STATUS: 127.0.0.1 (SECURE)

NEXT_DATA_LOG

Okta brings AI agent governance inside FedRAMP; identity-first agents meet enterprise reality

arrow_forward