Coding agents in production: architectur…

OPENAI PUB_DATE: 2026.03.26

CODING AGENTS IN PRODUCTION: ARCHITECTURE CHOICES, RELIABILITY BUDGETS, AND HITTING THE BRAKES

A wave of practitioner write-ups agrees: shipping coding agents is about reliability budgets and the right architecture, not flashy demos. At the AAAI 2026 wor...

A wave of practitioner write-ups agrees: shipping coding agents is about reliability budgets and the right architecture, not flashy demos.

At the AAAI 2026 workshop, a practitioner panel captured by Kiro shows production success depends on orchestration, evaluation, cost controls, latency budgets, and trust surfaces—not just model capability From copilots to coworkers.

Atal Upadhyay’s end-to-end guide pushes classic engineering discipline for agents—data-first design, simple structures, hardening for the five production pain points, and readiness audits Agentic Engineering. Nate’s taxonomy cuts through hype with four distinct agent architectures and a one-question diagnostic so you don’t pick the wrong tool for the job Four kinds of agents.

Simon Willison amplifies a caution from the Pi/OpenClaw world: unconstrained agents compound small mistakes fast, so enforce limits, keep humans in the loop, and protect architecture by hand Slowing the fuck down.

[ WHY_IT_MATTERS ]

01.

Reliability, cost, latency, and trust—not benchmark scores—decide whether coding agents stick in production.

02.

Picking the wrong agent architecture wastes money and floods codebases with low-quality changes.

[ WHAT_TO_TEST ]

terminal
Run a head-to-head on one repo: single-agent harness vs orchestration framework, measuring latency, pass rate, cost, and human interrupts.
terminal
Add write caps and required checks; measure defect rate and rework when agents exceed limits versus constrained output.

[ BROWNFIELD_PERSPECTIVE ]

Legacy codebase integration strategies...

01.
Start with narrow-scope agents behind PRs, with tool scopes and code ownership; wire full audit logs and replayable traces.
02.
Introduce an eval harness and canary services before expanding permissions; gate by SLAs on latency and fix rate.

[ GREENFIELD_PERSPECTIVE ]

Fresh architecture paradigms...

01.
Choose the agent type up front (harness, dark factory, auto research, orchestration) and encode specs and metrics as code.
02.
Design the data layer first: durable memory, task queues, idempotency, and telemetry that answers "is it helping?"

arrow_back

PREVIOUS_DATA_LOG

Diffblue ships a Testing Agent to auto-generate unit tests at scale; OSS adds polyglot coverage and security

Initialize_Return_to_Core

LINK_STATUS: 127.0.0.1 (SECURE)

NEXT_DATA_LOG

Make OpenClaw safe on real data: proxy guardrails, simple memory, and a voice UX

arrow_forward