Ship safer LLM agents with multi-turn, r…

OPENAI PUB_DATE: 2026.03.20

SHIP SAFER LLM AGENTS WITH MULTI-TURN, REGULATION-AWARE EVALS

DeepEval brings multi-turn, policy-aware testing for LLM chats into reach, while practitioners converge on structured prompts over tone tweaks. A new walkthrou...

DeepEval brings multi-turn, policy-aware testing for LLM chats into reach, while practitioners converge on structured prompts over tone tweaks.

A new walkthrough shows how to use DeepEval’s ConversationalTestCase and ConversationalGEval to write multi-turn tests that enforce policies like “no investment advice,” with a simple setup, self-hosting, and a UI for turn-by-turn inspection blog.

Parallel threads from the OpenAI community stress separating structure from tone in prompts for real-world workflows, offer a structured prompt framework to reduce cognitive load, and discuss “humanized” style without sacrificing reliability (structure vs tone, structured framework, humanized content).

Taken together: move beyond single-turn checks, gate releases on conversation-level tests, and treat tone as a layer on top of a strict prompt contract.

[ WHY_IT_MATTERS ]

01.

Single-turn evals miss regressions that only appear across conversation state and policy boundaries.

02.

Regulation-aware tests reduce compliance risk and give clearer CI gates for agent updates.

[ WHAT_TO_TEST ]

terminal
Stand up a DeepEval suite with a multi-turn compliance metric and run it on recent prod transcripts to baseline failure rates.
terminal
A/B structured prompts vs. “humanized” prompts; compare task success, latency, and token cost under identical evals.

[ BROWNFIELD_PERSPECTIVE ]

Legacy codebase integration strategies...

01.
Wrap existing chat endpoints with an eval harness and block deploys on multi-turn test regressions.
02.
Log conversations with prompt version, model, and seed to reproduce failing turns in the eval UI.

[ GREENFIELD_PERSPECTIVE ]

Fresh architecture paradigms...

01.
Define a prompt contract and evaluation metrics up front, separating structure from tone and compliance rules.
02.
Instrument traces and store turn-level artifacts to power conversation-aware regression tests from day one.

arrow_back

PREVIOUS_DATA_LOG

Codex Agents: Early Bugs, Cost Spikes, and a File Deletion Scare

Initialize_Return_to_Core

LINK_STATUS: 127.0.0.1 (SECURE)

NEXT_DATA_LOG

Claude Sonnet 4.6 targets deeper reasoning and structured outputs for repo-scale coding work

arrow_forward