OPENAI PUB_DATE: 2026.04.19

SANDBOXED CODING AGENTS: OPENAI UPDATES ITS AGENTS SDK, AND THERE’S A CLEAR WAY TO EVALUATE THEM

OpenAI’s Agents SDK now includes sandboxing and a model harness, and there’s a practical way to benchmark agentic coding and SRE bots. OpenAI shipped an Agents...

OpenAI’s Agents SDK now includes sandboxing and a model harness, and there’s a practical way to benchmark agentic coding and SRE bots.

OpenAI shipped an Agents SDK update that adds sandboxing and a new model harness, tightening control over what agent runs can touch and how they run DevOps.com. That reduces blast radius for codebase reads and shell actions.

If you’re trying these agents, treat evaluation as a system problem. Promptfoo outlines capability tiers, default safety postures, and CI-friendly comparisons across plain LLMs, SDK agents, and rich app servers, with concrete guidance on cost/latency/tool-call tradeoffs Evaluate Coding Agents. For ops inspiration, see an SRE-focused agent framework that automates incident investigation OpenSRE talk.

[ WHY_IT_MATTERS ]
01.

Sandboxing and harness controls curb accidental file, shell, or network access while you pilot agent workflows in real repos and environments.

02.

A consistent eval method lets you compare plain LLMs versus agents on accuracy, cost, latency, and side effects before rolling into CI or on-call.

[ WHAT_TO_TEST ]
  • terminal

    Run the same repository task across tiers (plain LLM vs SDK agent vs app-server) with promptfoo; record tool-call counts, latency, and dollar cost.

  • terminal

    In a staging repo, exercise the Agents SDK sandbox; verify file and network allow/deny rules, working directory boundaries, and safe fallbacks.

[ BROWNFIELD_PERSPECTIVE ]

Legacy codebase integration strategies...

  • 01.

    Introduce agents behind feature flags with scoped credentials and read-only clones; log every tool call for audit and rollback.

  • 02.

    Add an agent-eval job in CI that fails builds if cost, latency, or tool-call volume regress beyond thresholds.

[ GREENFIELD_PERSPECTIVE ]

Fresh architecture paradigms...

  • 01.

    Start with the minimal tool set in the sandbox and add capabilities only when evals show measurable ROI.

  • 02.

    Prototype SRE agents that produce investigative runbooks first; keep human-in-the-loop approvals for any actioning phase.

Enjoying_this_story?

Get daily OPENAI + SDLC updates.

  • Practical tactics you can ship tomorrow
  • Tooling, workflows, and architecture notes
  • One short email each weekday

FREE_FOREVER. TERMINATE_ANYTIME. View an example issue.

GET_DAILY_EMAIL
AI + SDLC // 5 MIN DAILY