ANTHROPIC PUB_DATE: 2026.03.26

ANTHROPIC’S THREE-AGENT HARNESS KEEPS LONG-RUNNING CODING AGENTS ON TRACK

Anthropic details a three-agent harness that keeps Claude coherent on multi-hour autonomous coding tasks by decomposing work and grading outputs. Anthropic’s e...

Anthropic’s three-agent harness keeps long-running coding agents on track

Anthropic details a three-agent harness that keeps Claude coherent on multi-hour autonomous coding tasks by decomposing work and grading outputs.

Anthropic’s engineering team shares how a planner–generator–evaluator setup improved Claude’s ability to ship full-stack apps in multi-hour runs by chunking work and scoring outputs against clear rubrics. Their post covers common failure modes like context drift and “context anxiety,” and how structured artifacts help carry state across sessions.

Inspired by GAN-style roles, a generator produces changes while an evaluator grades them with concrete criteria. A planner feeds steady, bite-sized tasks. The result: fewer derailments, more coherent progress, and better handoffs over long sessions.

[ WHY_IT_MATTERS ]
01.

Agent performance on long jobs depends as much on harness design as on model quality.

02.

A planner–generator–evaluator loop could stabilize multi-hour ETL, migrations, and infra updates where drift and partial context often burn teams.

[ WHAT_TO_TEST ]
  • terminal

    Prototype a planner–generator–evaluator loop on a non-critical repo; measure task completion rate, diff size, revert rate, and test pass rate over 2–4 hour runs.

  • terminal

    Compare structured artifacts plus resumable state versus a single long prompt on tasks exceeding context windows; track coherence loss and recovery time.

[ BROWNFIELD_PERSPECTIVE ]

Legacy codebase integration strategies...

  • 01.

    Wrap the agent behind your existing CI, linters, and service mocks; only allow PRs with evaluator-approved rubrics and tests to merge.

  • 02.

    Seed the evaluator with your legacy coding standards and non-functional checks to prevent regressions in critical modules.

[ GREENFIELD_PERSPECTIVE ]

Fresh architecture paradigms...

  • 01.

    Design specs as decomposable tasks with machine-readable artifacts from day one so agents can resume safely.

  • 02.

    Codify evaluator rubrics for correctness, latency budgets, and data quality to enforce guardrails automatically.

SUBSCRIBE_FEED
Get the digest delivered. No spam.