Engineering

Production Replay Testing in 2026: How to Simulate Real Sessions, Traces, and Calls

Synthetic test cases can't reproduce the bug a real user hit. Production replay reruns the exact session, trace, or voice call against your fixed agent.

May 29, 2026

7 min read

agent-simulation production-replay regression-testing voice-agents observability 2026

Table of Contents

Originally published May 29, 2026.

A user had a bad conversation with your support agent on Tuesday. On turn four it called issue_refund instead of check_status, and you only found out from the trace. You change the system prompt to fix it. Now the real question: did you fix it? You cannot re-run Tuesday’s conversation, because it happened in production and all you kept was a transcript in a logs table. So you write a synthetic test case that approximates it, watch that pass, and ship on faith.

That faith is the gap production replay testing closes. This post covers what replay is, why synthetic cases cannot reproduce real failures, and how to rerun the exact production session, trace, or voice call against your fixed agent, then keep it as a regression test.

What Is Production Replay Testing?

Production replay testing is the practice of rerunning a real production session, trace, or voice call against your changed agent, instead of testing on synthetic cases. You select the exact conversation that failed in production from your observability data, regenerate it as a simulation scenario, and run it end-to-end against your dev agent. Because the input is the real interaction a user actually had, it reproduces the failure synthetic test cases miss.

The closing move is that the replay is reusable. Save the replayed scenario into your regular runs and a one-off production bug becomes a permanent regression test, so the failure you just fixed cannot quietly come back.

Why Can’t Synthetic Test Cases Reproduce Production Failures?

Synthetic test cases are generated from a description of what you expect users to do. That makes them good at breadth and blind to specifics. The failure that actually broke production usually came from something you did not describe: an unusual phrasing, a rare context, a tool result you did not anticipate, a multi-turn history that set up the mistake.

The split worth keeping respects both. Generated scenarios cover the space you can imagine, and that coverage is real value. Replay covers the space you could not imagine, because production already found it for you. One is breadth, the other is the specific bug. You want both, and you reach for replay the moment a real interaction goes wrong and you need to reproduce exactly that, not something like it.

How Do You Replay a Production Session or Trace?

Replay builds on data you already have. With Observe capturing production sessions and traces, you select what to replay and the platform regenerates it as a scenario. The first choice is the unit:

Replay type	What reruns	Use when
Session	All traces sharing a `session_id`, ordered by start time, as one multi-turn conversation	Reproducing failures that depend on conversation history
Trace	Each selected trace as a single-turn conversation, input to output	Reproducing individual calls or single-turn interactions in bulk

The flow is five steps, none of which needs new instrumentation:

Select production data and create a replay session with the project_id, the replay_type (session or trace), and either a list of ids or select_all.
Generate the scenario from the transcripts. You supply an agent_name and scenario_name, set agent_type to text for chat, and a no_of_rows count (default 20). The platform creates or updates an agent definition and builds a graph scenario sourced from the production conversations.
Create a run test that uses the replay session’s agent definition and scenario, passing the replay_session_id to link them.
Run the simulation from the UI or the chat simulation SDK, exactly like any other run.
View results and iterate: change the agent, replay again, compare.

What you get back is the same result surface as any simulation: chat completion stats, system metrics (avg output tokens, latency, turn count, CSAT), aggregated eval scores, and a turn-by-turn transcript with a diff against the original production conversation. Because the scores attach to the run, you can layer trace-native evaluation on top to grade each replayed turn automatically.

How Does Voice Replay Differ?

Voice replay reruns real production voice calls, and it has to reconstruct more than a transcript. The platform extracts the original voice configuration, the system prompt, assistant settings, and provider config, from the production trace’s call log, then builds a voice agent definition with a snapshot matching the original call. From there it generates a scenario and reruns it through voice simulation.

The results are richer because voice has more to compare. You get a side-by-side transcript comparison, a performance-metrics comparison, and audio playback of both the baseline production call and the replayed one, so a fix for a misheard order or a bad tone is something you can hear, not just read. Voice replay supports Vapi as the primary provider, with Retell supported for transcript comparison.

How Do You Turn a Replay Into a Regression Test?

The first replay reproduces the bug and proves the fix. The second value, the one that compounds, is keeping it. Because the replay produced a scenario generated from the real transcript, you save that scenario into your regular simulation runs and it becomes a permanent test case.

From then on, every prompt, model, or tool change runs against it. The Tuesday refund bug is no longer a thing you fixed and hope stays fixed; it is a scenario that fails loudly the moment a change reintroduces it. This is the difference between firefighting and a growing safety net: each production incident you replay adds one more real-world case to the suite, built from traffic instead of imagination.

Future AGI Replay execution results showing Performance Metrics for a completed 10-call replay session: 10 Total Calls, 10 Connected, 100% connection rate, Agent Latency 2867ms, Agent WPM 229.1, Agent Stop Latency 246ms, and Evaluation Metrics with Avg Toxicity at 100% — plus a full call details list with timestamps, end reasons, and per-call scores.

How Does Production Replay Compare to Synthetic Scenarios?

Dimension	Synthetic scenarios	Production replay
Source	Generated from a description	A real captured session or trace
Reproduces the exact failure	No, an approximation	Yes, the real interaction
Before/after comparison	None	Diff and metrics vs the original
Coverage	Breadth you can imagine	The specific case production found
Best for	Pre-ship breadth and edge generation	Debugging and regressing real incidents

The two are complements, not rivals. Generate scenarios to cover the map; replay production to fix the spots the map missed.

Where It Falls Short

It needs Observe in place first. Replay reruns data that observability captured, so production sessions and traces have to be flowing into the platform before there is anything to replay. The flip side is there is no new integration to add for replay itself.
Voice replay is Vapi-first. Vapi is the primary supported provider for config extraction; Retell is supported for transcript comparison, with config extraction optimized for Vapi’s data structure.
It reproduces, it is not the original runtime. Replay reruns the captured conversation against your dev agent, so it is as faithful as the trace you captured. Instrument well, and the replay is close; instrument thinly, and you replay less than happened.

Why Replay Belongs in Your Testing Loop

The bugs that matter most are the ones you did not predict, and by definition your synthetic tests did not cover them either. Production already ran the experiment that found them; replay is how you get the result back into your dev loop instead of leaving it in a logs table. Rerun the real session against the fix, compare it turn by turn to what shipped, and keep the scenario so it guards the fix forever. The conversation that broke on Tuesday becomes the test that protects you on every Tuesday after.

Want to rerun the exact session that broke in production? Connect Future AGI Observe and use Replay to turn a real session, trace, or voice call into a simulation you can fix against and keep.

Sources

Frequently asked questions

What is production replay testing?

Production replay testing reruns a real production conversation against your changed agent, rather than testing on synthetic inputs. You pick the exact session, trace, or voice call from your observability data, regenerate it as a simulation scenario, and run it end-to-end against your dev agent. Because the input is the interaction a real user actually had, it reproduces failures that synthetic test cases never recreate. In Future AGI it is the Replay feature: it builds on Observe (which captured the production data) and chat or voice simulation (which reruns it), with no new integration to add.

How is replaying production different from synthetic test data?

Synthetic test data is invented; replay uses what actually happened. Synthetic cases are generated from a description of what you think users will do, which is useful for breadth but blind to the specific phrasing, edge case, or context that broke production. Replay takes the real session or trace, with its real turns and tool calls, and reruns it. The practical split: generate synthetic scenarios to cover the space you can imagine, and replay production to fix the failures you could not. Replay also gives you a before-and-after comparison against the original conversation, which synthetic cases cannot.

What is the difference between replaying a session and a trace?

A session replay reruns a full multi-turn conversation; a trace replay reruns a single call. In Future AGI, a session groups all traces sharing a session_id, ordered by start time, so replaying it reproduces the entire back-and-forth as one multi-turn chat scenario. A trace replay treats each selected trace as its own single-turn conversation, input to output. Use session replay to reproduce conversational failures that depend on history, and trace replay to reproduce individual calls or single-turn interactions in bulk.

Can you replay production voice calls?

Yes, voice replay reruns real production voice calls in a dev environment. Future AGI extracts the original voice configuration (system prompt, assistant settings, provider config) from the production trace's call log, builds a voice agent definition with a snapshot matching the original call, generates a scenario from the conversation, and reruns it through voice simulation. Results include a side-by-side transcript comparison, a performance-metrics comparison, and audio playback of both the baseline and the replayed call. Voice replay supports Vapi as the primary provider, with Retell supported for transcript comparison.

How do you turn a production failure into a regression test?

Replay it once, confirm the fix, then save the replayed scenario into your regular simulation runs. The replay produces a scenario generated from the real transcript; once you have used it to reproduce and fix the bug, it becomes a permanent test case that runs alongside your other scenarios. The next time a prompt, model, or tool change risks reintroducing that failure, the saved replay catches it. This is how a one-off production incident becomes a guardrail instead of a recurring surprise.

Do you need a new integration to replay production data?

No. Replay builds on two things you already have: Observe, which captures production sessions and traces, and chat or voice simulation, which reruns them. If Observe is integrated and sending production data to the platform, you can create a replay session from that data without adding any new instrumentation. You need your FI_API_KEY and FI_SECRET_KEY for the replay and simulation APIs, and for SDK-driven runs, a chat agent callback and any LLM provider keys it uses.

View all

Engineering

Inside Observe: The Six Surfaces of Production Agent Observability in 2026

Production observability has to answer six questions. Here is the Observe surface for each: sessions, users, trace evals, dashboards, alerts, and voice.

NVJK Kartik · May 29, 2026

6 min

Engineering

Scenarios vs Synthetic Data in 2026: Why Testing an Agent Isn't Generating Rows

Synthetic data is static rows you score once. A scenario is a multi-turn conversation your agent has to navigate. Here is the difference and when each fits.

NVJK Kartik · May 29, 2026

5 min

Engineering

Trace-Native Evaluation in 2026: Score the Whole Trace, Skip the Data Mapping

Most eval loops export logs, build a dataset, map columns. Trace-native evaluation attaches the score to the span itself and runs on production traces.

NVJK Kartik · May 29, 2026

7 min