Home / Changelog / 2026 Week 2
2026 W2
Share

Chat Simulation via Observe, Pre-Built Evaluation Groups, and Fix My Agent for Chat

Launch chat simulations directly from real production conversations, pick from 10 ready-to-use evaluation groups with no configuration, and get Fix My Agent diagnostics for chat agents.

Simulate Evaluate Agents Platform API
10 Pre-built evaluation groups
1-click Simulate from Observe

Chat Simulation via Observe

The most requested feature since Chat Simulation V1 launched: “let me simulate starting from a real conversation.”

What’s new

  • Browse, click, simulate. Find a customer interaction in Observe (a complex support case, a successful upsell, a confusing onboarding flow) and launch a simulation from it with a single click.
  • Auto-generated persona from the real user. The system extracts the customer’s communication style, intent progression, and behavioral patterns and generates a simulation persona that mirrors them.
  • Scenario preserves structure. The conversation structure carries over. The agent is free to respond differently based on its current configuration.

Why it matters

The path from “I noticed something interesting in production” to “I have a repeatable test for it” collapses to one click. No manual scenario authoring, no persona configuration, no transcript formatting.

Who it’s for

Quality assurance (QA) and product teams who want their test suite to continuously reflect production conversations rather than a curated ideal set.

Read the docs →

Pre-Built Evaluation Groups

Not every team wants to configure evaluations from scratch. Ten pre-built groups give you a baseline quality assessment out of the box.

What’s new: the 10 groups

  • Accuracy and Faithfulness. Factual correctness, hallucination detection, source attribution.
  • Conversational Quality. Coherence, relevance, helpfulness, tone consistency.
  • Task Completion. Goal achievement, step completion, error recovery.
  • Safety and Compliance. Harmful content detection, policy adherence, personally identifiable information (PII) handling.
  • Voice Quality. Pronunciation accuracy, latency, naturalness, interruption handling.
  • Retrieval Quality. Context relevance, retrieval precision, chunk attribution.
  • Multi-turn Consistency. Context retention, contradiction detection, topic tracking.
  • Escalation Handling. Detection accuracy, handoff quality, context preservation.
  • Multilingual Quality. Translation accuracy, cultural appropriateness, code-switching.
  • Efficiency. Response latency, token usage, cost per conversation.

Attach one to a simulation. Run. Adjust later as you develop a more nuanced quality bar, or use as-is for a fast baseline.

Who it’s for

Teams new to evaluation setup who want a sensible starting point, and teams that want a standard baseline they can layer custom evaluations on top of.

Read the docs →

Fix My Agent for Chat

Fix My Agent now works on chat-based agents with text-specific failure-mode detection.

What’s new

  • Chat-specific diagnostics. The diagnostic engine understands chat-native failure modes: context window overflow, system prompt drift in long conversations, formatting inconsistencies across response types.
  • Same structured output. Root cause analysis, impact ranking, and specific fix suggestions, adapted for text interactions.
  • Naming unification. The “Optimize My Agent” surface was renamed to “Fix My Agent” so the diagnostic and the optimiser live under one product name.

Who it’s for

Teams shipping text-based chat agents who want the same diagnostic workflow voice-agent teams have been using.

Read the docs →

Chat Simulation V1 Launch Polish

The remaining launch pieces for Chat Simulation V1 land together: the persona section, chat logs with inline traces and attributes, the evaluation mapping flow, and the status and UI hygiene that ties them into one product. Everything needed to run Chat Sim end to end is now in place.

Agent Prompt Optimiser on Platform

The Agent Prompt Optimiser is now accessible directly from the platform UI. Pick a strategy, choose target calls, run optimisation. No API code required.

Additional Improvements

Domain-level metrics and human comparison summary. Simulation runs roll up metrics by domain (support, sales, onboarding, whatever you’ve tagged) and show a side-by-side summary against human-handled conversations, so you can see where the agent is on par and where it lags.

Agent prompt conformance evaluation. New evaluation metric that measures how closely an agent follows its prompt instructions.

Additional LLM models. Latest releases from each major provider are now available in evaluation and prompt surfaces, so new models become testable without waiting on a separate Future AGI release cycle.

Unified chat message roles. Message roles (user, assistant, system, tool) and dashboard labels match across every chat surface, so a conversation captured in Observe reads the same in Simulate and the Workbench.

Call analytics drawer: transcripts + manual graph creation. The call analytics drawer now exposes a transcripts view and a manual graph-creation flow alongside the existing call-level metrics.

Dynamic model parameter updates. Model parameters refresh automatically based on provider API capabilities, so no manual updates are needed when a provider adds a model variant.

Audio content validation. Audio inputs are validated for format and quality before submission to audio models.

Manage replay sessions through the API. Programmatic control of replay sessions, useful for wiring replay regression checks into CI/CD pipelines.

Streamlined persona management in scenarios. Assign and swap personas with a simplified inline interface. Edit persona info directly from scenario columns.

Simulation status visibility. Stage-level progress indicators (scenario generation, conversation execution, evaluation scoring) in real time.

API key management. Delete API keys from the dashboard.

Edit Run Experiment. Edit experiment configuration mid-run (swap a model, change a threshold) without restarting. Configuration history is tracked per data point so the audit trail stays clean.

Deploy branch on dev. Engineers can deploy a branch to dev from the UI.