Guides

12 Metrics for AI Conversation Monitoring in 2026: The Five-Axis Catalog

Q: Why twelve metrics across five axes instead of one CSAT score?

CSAT is a lagging, low-signal aggregate. It tells you the customer was unhappy; it does not tell you whether the failure was a hallucinated policy, a missed escalation, a tone slip, a memory loss in turn six, or a tool-call that doubled cost. The five-axis frame (coherence, resolution, safety, cost, adaptation) keeps those failure modes orthogonal. When the dashboard shows ConversationCoherence sliding while ConversationResolution holds, you know context is leaking between turns even though the user still got their answer. When PII leak rate climbs while Groundedness is steady, you know retrieval is fine but redaction broke. Aggregate the twelve into a single 'agent health' score and you lose the diagnostic. The metrics are useful because they decompose, not because they roll up.

Q: Which of the twelve gate a release and which trend weekly?

Three are CI-gate metrics that block a deploy on regression: TaskCompletion (FAGI eval_id 99), Hallucination rate (Groundedness inverse, eval_id 47), and PII leak rate (DataPrivacyCompliance fails, eval_id 22). Four are SLO metrics that page on breach in production: Latency p95, Refusal accuracy on the negative test set, Escalation accuracy on the sensitive-intent slice, and ConversationResolution (eval_id 2). Five are weekly KPIs reviewed in the product meeting: ConversationCoherence (eval_id 1), ContextRetention, Tone consistency, Cost per resolved conversation, and CSAT correlation. The split is deliberate: pages cover outages and harm, gates cover release safety, weekly trends cover product drift. Mixing tiers turns the dashboard into noise.

Q: What is the difference between ConversationResolution and TaskCompletion?

They score the same conversation from two angles. TaskCompletion (eval_id 99) is the agent-side judgment: the system prompt declared a goal, the rubric scores whether the transcript executed it. ConversationResolution (eval_id 2) is the user-side judgment: did the human leave with their actual problem closed. They diverge when the agent finished its declared task but the user still walked away frustrated, or when the agent never closed its checklist but the user got what they came for from a partial answer. Run both. Disagreements between them are the highest-signal cluster in the Error Feed because they surface either a misaligned goal definition or a flawed retrieval path the agent then papered over.

Q: How do I measure ContextRetention without ground truth?

ContextRetention is the turn-to-turn memory question: in turn six, did the agent still know what the user said in turn one. FAGI ships CustomerAgentContextRetention as a dedicated cloud rubric that scores the transcript holistically. The cheap proxy is to bake a synthetic checkpoint into your test conversations (a name, an order id, a preference) and assert the agent uses it correctly downstream. In production, the rubric runs end-of-session over the full transcript and emits a 1-5 score plus the offending turn. Pair it with ConversationCoherence (eval_id 1), which catches the broader 'did the conversation make sense across turns' question. ContextRetention is the narrower probe; coherence is the wider one.

Q: How do I track Cost per resolved conversation if the agent uses three different models?

Tag every span with the resolution outcome (resolved, escalated, abandoned) and the model id. The denominator is the count of resolved sessions; the numerator is total token spend across every model and every span in that session, including retries, retrieval, guardrail evaluations, and the LLM-judge rubric scoring itself. The right unit is cost-per-resolution, not cost-per-call: a cheap call that fails and re-routes is more expensive than an expensive call that closes the ticket. With traceAI the per-span token attributes feed a SQL view; with Agent Command Center the gateway tracks token totals per virtual key, so a single dashboard panel reads the dollar figure directly. Reset baselines after every model swap.

Q: Why is CSAT only a proxy in this framework?

CSAT correlates with whether the user was helped, but it is a survey response with a 5-15% reply rate, a self-selection bias toward extremes, and a 24-72 hour lag. Treat it as one signal, not the truth. The eleven other metrics are higher-fidelity, faster, and decomposable; CSAT is the slow business validator that confirms the other eleven are pointing the right way. When CSAT drifts opposite to TaskCompletion + ConversationResolution, something in the metric definitions has decoupled from the customer's actual experience, and that is the alarm worth investigating. The proxy framing is what stops teams from optimizing for CSAT directly, which always backfires because the agent learns to be liked rather than to be useful.

Q: How does FAGI wire these twelve into the same trace?

Every conversation is a span tree in traceAI. The Latency p95 metric reads root-span duration. Cost per resolved conversation reads the sum of token-cost attributes on every child span joined to the resolution outcome tag. The nine semantic metrics run as ai-evaluation rubrics scoring the captured transcript, either in CI against a fixed dataset or against a sampled production canary via EvalTag. PII leak rate is the inverse of DataPrivacyCompliance pass-rate (eval_id 22) running both inline as a Protect guardrail (block) and end-of-session as an audit rubric (catch). Error Feed clusters the failing traces into named issues with auto-written root cause, evidence quotes, and an immediate_fix. One span tree, three measurement paths, twelve dashboard panels.

Twelve metrics across five axes for monitoring conversational agents in production: coherence, resolution, safety, cost, adaptation. Wiring included.

April 9, 2026

Updated May 20, 2026

15 min read

conversation-monitoring agent-evaluation observability llm-as-judge production 2026

Most production conversation monitoring is one CSAT score and a latency dashboard. That tells you the agent is alive and that some users are unhappy. It does not tell you whether yesterday’s regression came from a memory leak in turn six, a tone slip on refunds, a hallucinated policy, a missed escalation, or a tool retry that doubled cost. Aggregate the answer into a single number and you lose the diagnostic.

The opinion this post earns: conversation monitoring decomposes into five orthogonal axes, and twelve metrics is the minimum vocabulary to debug what is breaking. Coherence (did the conversation make sense across turns), Resolution (did the user leave with their problem closed), Safety (did the agent refuse, redact, and escalate where it should), Cost (did the resolution land within budget), Adaptation (do the numbers move with the customer, not the engineer). Drop an axis and a class of failure goes dark. Roll the axes into one composite score and the dashboard starts lying.

This guide is the catalog. Twelve named rubrics, the FAGI eval template id for each, when to gate-page-or-trend on it, threshold guidance, and the wiring through traceAI and the ai-evaluation SDK that lets one span tree feed every panel.

The five-axis frame

Five axes, twelve metrics. Each axis covers one class of failure the others cannot see.

Axis	What it answers	Metrics on this axis
Coherence	Did the conversation make sense across turns?	ConversationCoherence, ContextRetention, Tone consistency
Resolution	Did the user leave with their goal met?	ConversationResolution, TaskCompletion, Escalation accuracy
Safety	Did the agent refuse, redact, and stay grounded?	Hallucination rate, PII leak rate, Refusal accuracy
Cost	Did the resolution land inside the unit budget?	Latency p95, Cost per resolved conversation
Adaptation	Are the numbers tracking the user, not the engineer?	CSAT correlation (proxy)

The grid is asymmetric on purpose. Coherence, Resolution, and Safety each carry three metrics because each has multiple distinct failure modes a single rubric cannot cover. Cost carries two because latency and dollars are the only operational unit budgets that matter. Adaptation carries one because CSAT correlation is the only durable proxy for whether the other eleven are still measuring the right thing.

Three rules govern the catalog:

Three measurement paths per metric. Span attribute (latency, cost), inline guardrail (PII, prompt-injection at request time), or end-of-session rubric (everything else). Most metrics land on the rubric path.
Three duty cycles. CI gates block a deploy on regression. SLOs page the on-call on breach. KPIs trend in the weekly product review. Every metric is exactly one of the three.
One span tree, every metric. No second datastore. traceAI captures the conversation; rubric scores attach as span attributes; the dashboard reads everything off the same tree.

Duty	Metrics
CI gate (block deploy on regression)	TaskCompletion, Hallucination rate, PII leak rate
SLO (page on breach)	Latency p95, Refusal accuracy, Escalation accuracy, ConversationResolution
Weekly KPI (trend in product review)	ConversationCoherence, ContextRetention, Tone consistency, Cost per resolved conversation, CSAT correlation

That split is the contract the rest of this post enforces.

Axis 1: Coherence

Coherence asks the question that single-turn evals cannot: did the conversation hold together across the full session. Three metrics, all FAGI rubric scores, all weekly KPIs because none of them carry the kind of hard breach signal that justifies a page.

1. ConversationCoherence (eval_id 1)

What it measures. Whether the agent’s turns flow logically from each other and from the user’s turns. Catches contradictions across turns, abrupt topic shifts the user did not signal, and the “agent forgot what it just said” failure that breaks trust faster than any single bad answer.

Duty. Weekly KPI. End-of-session rubric over the full transcript. Set the threshold floor; review when it drops more than three points week-over-week.

How to wire it.

from fi.evals import Evaluator, ConversationCoherence

evaluator = Evaluator(fi_api_key="...", fi_secret_key="...")
result = evaluator.evaluate(
    eval_templates=[ConversationCoherence()],
    inputs=[{"conversation": session_transcript}],
)
coherence = result.eval_results[0].metrics[0].value

The same template runs as an EvalTag on every captured trace via traceAI, so the production canary scores match the CI baseline.

Threshold. Mean coherence score above 4.0 on the 1-5 scale. Below 3.5 means turn-to-turn drift is regressing user experience even when individual answers look right.

2. ContextRetention

What it measures. Did the agent still know in turn six what the user said in turn one. The long-horizon memory probe. FAGI ships CustomerAgentContextRetention as the dedicated cloud rubric; it scores the full transcript and flags the turn where context was dropped. For agents backed by a persistent memory store, evaluating agent memory systems covers the deeper failure modes.

Duty. Weekly KPI. Pair with synthetic checkpoint conversations in CI (bake a name, an order id, and a preference into early turns, assert the agent uses them correctly in later turns) for the gated regression view.

Threshold. Score above 4.0 on the rubric. Median agent-position-of-failure should sit past turn seven; if the agent loses context before turn five, the prompt or the context window is the bottleneck, not the rubric.

3. Tone consistency

What it measures. Whether the agent’s voice stays steady across the session. Friendly turn one, terse turn three, formal turn five reads as a different agent every time and erodes trust. FAGI’s Tone (eval_id 16) scores conformance to a target tone descriptor; pair with CustomerAgentLanguageHandling for multilingual conversations where the failure is code-switching mid-conversation.

Duty. Weekly KPI. Sample 5-10% of production sessions; the score is too stable per session to need 100% coverage.

Threshold. Variance across turns inside one session matters more than the mean. A session with a tone score of 4.8 average but a 1.5 floor on one turn is the dashboard alarm; a steady 4.0 everywhere is fine.

Axis 2: Resolution

Resolution is whether the agent did its job. Two metrics scored from opposite sides of the conversation plus one that scores the handoff.

4. ConversationResolution (eval_id 2)

What it measures. From the user’s side: did the human leave with their problem closed. The user-side close metric.

Duty. SLO. Pages on a 1-hour rolling window below threshold. The cleanest outcome signal on the grid because it scores from the perspective of the buyer, not the agent.

How to wire it.

from fi.evals import Evaluator, ConversationResolution

result = evaluator.evaluate(
    eval_templates=[ConversationResolution()],
    inputs=[{"conversation": session_transcript}],
)
resolution = result.eval_results[0].metrics[0].value

Threshold. Mean score above 4.0 (or pass-rate above 80% if you binarize at 4). Page on a 1-hour rolling window dropping more than 5 points from baseline.

5. TaskCompletion (eval_id 99)

What it measures. From the agent’s side: the system prompt declared a goal; did the transcript execute it. The agent-side goal metric.

Duty. CI gate. Block the deploy if the score on the held-out test set drops more than two points from the prior release. Also runs on every production session as an SLI; pair the gate view with the production trend so a model swap that passes CI but tanks production gets caught fast.

Why both ConversationResolution and TaskCompletion. They disagree on the most interesting failures. The agent finished its checklist but the user walked away unhappy (resolution low, completion high) means the goal definition is wrong. The agent never closed its checklist but the user got what they came for (resolution high, completion low) means the agent improvised past a step the spec assumed mattered. Both directions get clustered into the Error Feed automatically.

Threshold. Above 85% pass-rate on the gold dataset for CI. Above 75% on the production trend for the SLI view.

6. Escalation accuracy

What it measures. Did the agent route the sensitive turn to a human, refuse to answer when it should have refused, and not auto-resolve a billing-sensitive ticket. The confusion-matrix metric. The right framing is a 5x5 matrix on a taxonomy that the support lead signed off on (in-scope answer, in-scope escalate, out-of-scope refuse, ambiguous clarify, billing-sensitive route-to-human); accuracy on the sensitive slice is the page-worthy slice. See How to build and evaluate a customer support chatbot for the full taxonomy.

Duty. SLO. The metric is the per-tier accuracy floor; pages fire when the billing-sensitive accuracy drops below the floor in a rolling window.

How to wire it. CustomerAgentHumanEscalation is the FAGI cloud rubric; deterministic tier-emission checks (did the agent output the right structured tier label) gate the CI suite alongside.

Threshold. 99% accuracy on billing-sensitive and PCI-adjacent tiers. 95% on out-of-scope refuse. The other tiers above 90% as a release floor. False negatives on the sensitive tiers are the harm metric; weight them 5x in the audit.

Axis 3: Safety

Safety covers the three classes of harm that production conversation agents emit and need to suppress: groundedness failures, privacy violations, and helpful-when-it-should-refuse failures. Each carries a hard threshold because each is a regulator-facing risk.

7. Hallucination rate (Groundedness inverse, eval_id 47)

What it measures. Fraction of agent turns making claims not supported by retrieved context, declared knowledge, or tool output. The inverse of Groundedness because most teams want to alarm on the failure tail, not on the success rate. The methods that actually detect hallucinations cover the scoring side in depth.

Duty. CI gate. Block deploy if hallucination rate on the gold set climbs more than one point. Also runs end-of-session in production as a quality SLI.

How to wire it.

from fi.evals import Evaluator, Groundedness

result = evaluator.evaluate(
    eval_templates=[Groundedness()],
    inputs=[{
        "input": user_query,
        "output": agent_answer,
        "context": retrieved_chunks,
    }],
)
groundedness = result.eval_results[0].metrics[0].value
hallucination_rate = 1 - groundedness

For RAG-heavy conversations, pair Groundedness with ContextAdherence (eval_id 11) and ChunkAttribution (eval_id 12); the three together separate “wrong because retrieval surfaced the wrong chunk” from “wrong because the agent ignored the right chunk” from “wrong because the agent invented something not in the chunks at all.”

Threshold. Hallucination rate under 3% on the gold dataset for CI. Under 7% on the production trend for the SLI. Higher than 7% means the retriever or the generator is broken and the dashboard is no longer the right artifact; reach for the trace.

8. PII leak rate (DataPrivacyCompliance fails, eval_id 22)

What it measures. Fraction of agent turns that emit personally identifiable information that should have been redacted or never spoken aloud. Names, emails, account numbers, payment data, government identifiers. The metric is the inverse of DataPrivacyCompliance pass-rate.

Duty. CI gate and inline guardrail. The CI gate runs the rubric end-of-session on a fixed dataset; the inline guardrail runs Protect at request time and blocks before the user sees the leak.

How to wire it. Two paths. Inline: Future AGI Protect runs four Gemma 3n LoRA adapters (toxicity, bias_detection, prompt_injection, data_privacy_compliance) at 65 ms text and 107 ms image median time-to-label per arXiv 2510.13351; a fail blocks the response. End-of-session: DataPrivacyCompliance runs as an audit rubric and the fail count feeds the dashboard.

Threshold. Inline block rate of zero false negatives on the sensitive entity set (account ids, payment data, ssn-class identifiers). End-of-session audit rate of less than 0.1% on the production sample. Anything higher and the guardrail config or the redaction layer is leaking.

9. Refusal accuracy (AnswerRefusal, eval_id 88)

What it measures. When the agent should refuse (out-of-scope question, jailbreak attempt, request beyond authority), did it refuse. When it should answer, did it not over-refuse and frustrate the user. Two-sided: false-negative refusal (answered when should have refused) is a safety failure, false-positive refusal (refused when should have answered) is a UX failure. Both go on the dashboard.

Duty. SLO on the refusal-required slice (red-team prompts, jailbreaks, scope-violation requests). KPI on the over-refusal slice.

How to wire it. AnswerRefusal (eval_id 88) is the cloud rubric. Run it on a fixed negative test set in CI (50-200 prompts the agent must refuse), then on a production sample for drift.

Threshold. 98% accuracy on the must-refuse set. Over-refusal rate under 3% on the broad production sample. A 99% refusal rate looks safe but is usually the dashboard alarm: the agent is refusing too much and burning trust.

Axis 4: Cost

Cost is the unit-economics axis. Two metrics: time and money. Both are operational, neither is a quality judgment, but they bound whether the agent can ship at all.

10. Latency p95 (end-to-end turn)

What it measures. Wall-clock time from user-end-of-turn to first agent response token. The only universally trusted SLO in conversation.

Duty. SLO. Page on a 5-minute rolling window above threshold. Always p95, never the mean; the mean hides the 1-in-20 user who waits four seconds and hangs up.

How to wire it. traceAI captures span timings natively. The root conversation span has duration_ms; child spans break down LLM (llm.duration_ms), retrieval (retrieval.duration_ms), tool (tool.duration_ms). The dashboard panel reads p95(turn.duration_ms) over a 5-minute rolling window.

Threshold.

Workload	p95 turn target
Text chat, no tool calls	< 1.2 s
Single tool call	< 2.0 s
RAG + single tool	< 3.0 s
Voice (cascaded STT/LLM/TTS)	< 800 ms
Voice (speech-to-speech)	< 500 ms

11. Cost per resolved conversation

What it measures. Total token spend per closed session, denominated against resolution outcome. The right unit is dollars-per-resolution, not dollars-per-call. A cheap call that fails and re-routes is more expensive than a costlier call that closed the ticket.

Duty. Weekly KPI. Reset baseline after every model swap because the cost curve shifts.

How to wire it. Tag every span with the resolution outcome (resolved, escalated, abandoned) and the model id. With traceAI the per-span token attributes feed a SQL view. With Agent Command Center, the gateway tracks token totals per virtual key, so the dashboard panel reads the dollar figure off the gateway’s Prometheus surface directly: agentcc_cost_total divided by the resolved-session count.

Threshold. Workload-specific. The right floor is your previous model’s cost-per-resolution minus your committed efficiency gain; the alarm is the week the figure drifts above that floor by more than 15%.

Axis 5: Adaptation

The adaptation axis is the smallest because it does only one job: confirm the other eleven metrics still track the customer. Goodhart’s law catches every metrics framework eventually; CSAT correlation is the canary.

12. CSAT correlation (proxy)

What it measures. The Pearson or Spearman correlation between your aggregate eval score (TaskCompletion plus ConversationResolution, equally weighted, is a defensible default) and the survey-collected CSAT score on the same session. Not CSAT itself; the correlation between the dashboard’s verdict and the customer’s verdict.

Duty. Weekly KPI. The number that says whether the other eleven metrics are still measuring what the customer cares about.

How to wire it. Capture the CSAT response as a span attribute on the session root when the survey reply lands (usually 24-72 hours later via webhook). The dashboard panel reads the rolling 4-week correlation. When the correlation drifts toward zero, the metrics have decoupled from the user; review the rubric definitions before the next deploy.

Threshold. Correlation above 0.45 on the 4-week window is a working dashboard. Below 0.30 means the eval suite and the customer have started disagreeing; that is the alarm worth investigating, not a CSAT dip in isolation.

The CSAT framing is deliberately a proxy, not a target. Optimize for CSAT directly and the agent learns to be liked rather than to be useful. Optimize for the eleven decomposed metrics and audit the correlation, and the optimization stays honest.

Wiring the twelve into traceAI and the Error Feed

One span tree feeds every metric. Three measurement paths produce the values.

Measurement path	Metrics	FAGI surface
Span attribute (transport)	Latency p95, Cost per resolved conversation	traceAI native
Inline guardrail (request-time block)	PII leak rate (block path), prompt-injection adjacency on Escalation	Protect
End-of-session rubric (semantic)	The other nine	ai-evaluation `EvalTag` on the trace

from fi_instrumentation import FITracer, TracerProvider, BatchSpanProcessor, HTTPSpanExporter
from fi_instrumentation.fi_types import EvalTag, EvalTagType, EvalSpanKind, EvalName, ModelChoices

trace_provider = TracerProvider(
    project_name="support-agent-prod",
    eval_tags=[
        EvalTag(
            type=EvalTagType.OBSERVATION_SPAN,
            value=EvalSpanKind.LLM,
            eval_name=EvalName.GROUNDEDNESS,
            model=ModelChoices.TURING_LARGE,
            mapping={"input": "input.value", "output": "output.value", "context": "context.value"},
        ),
        EvalTag(
            type=EvalTagType.OBSERVATION_SPAN,
            value=EvalSpanKind.AGENT,
            eval_name=EvalName.CONVERSATION_RESOLUTION,
            model=ModelChoices.TURING_LARGE,
            mapping={"conversation": "input.value"},
        ),
        EvalTag(
            type=EvalTagType.OBSERVATION_SPAN,
            value=EvalSpanKind.AGENT,
            eval_name=EvalName.TASK_COMPLETION,
            model=ModelChoices.TURING_LARGE,
            mapping={"input": "input.value", "output": "output.value"},
        ),
    ],
)
trace_provider.add_span_processor(BatchSpanProcessor(HTTPSpanExporter()))

Sample 5-10% of production traffic for the LLM-judge rubrics; deterministic checks (PII regex, refusal-set pass, escalation-tier emission) run on 100%. The expensive rubrics (the three on the Coherence axis, the two on Resolution beyond TaskCompletion, the CSAT correlation join) run async via evaluator.submit() so they do not sit on the request path. The cheap deterministic checks run inline.

The Error Feed clusters the failing traces. HDBSCAN soft-clustering over ClickHouse-stored span embeddings groups failing conversations into named issues. A Claude Sonnet 4.5 Judge agent investigates each cluster (30-turn budget, 8 span-tools, a Haiku Chauffeur sub-agent for large spans, prompt-cache hit ratio near 90%) and writes the RCA, evidence quotes, an immediate_fix, and a four-dimensional score (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution; 1-5 each). When PII leak rate breaches at the same hour Hallucination rate climbs, the cluster reads “redaction layer dropped after deploy 2026.05.18” rather than firing two separate pages.

How Future AGI ships the twelve

The eval stack ships as a package. The opinion is that conversation monitoring is the use case that justifies the package because every metric needs every layer.

ai-evaluation (Apache 2.0): 60+ EvalTemplate classes including ConversationCoherence, ConversationResolution, TaskCompletion, Groundedness, DataPrivacyCompliance, AnswerRefusal, Tone, plus the 11 CustomerAgent* templates (CustomerAgentContextRetention, CustomerAgentHumanEscalation, CustomerAgentLanguageHandling, CustomerAgentConversationQuality, CustomerAgentLoopDetection, CustomerAgentTerminationHandling, CustomerAgentQueryHandling, CustomerAgentClarificationSeeking, CustomerAgentObjectionHandling, CustomerAgentInterruptionHandling, CustomerAgentPromptConformance). Run sync, async via evaluator.submit(), or on the trace via EvalTag.
Future AGI Platform: self-improving evaluators tuned by support-lead thumbs feedback; in-product authoring agent writes domain-specific rubrics from natural-language descriptions; classifier-backed evals at lower per-eval cost than Galileo Luna-2.
traceAI (Apache 2.0): OpenTelemetry SDK, 50+ AI surfaces across Python, TypeScript, Java, C#; auto-instrumentation for OpenAI, LangChain, LangGraph, Groq, Portkey, Gemini; 14 span kinds with first-class RETRIEVER, TOOL, and AGENT; PII redaction on span attributes.
Protect: four Gemma 3n LoRA adapters plus Protect Flash; 65 ms text / 107 ms image median time-to-label per the Protect paper; inline guardrail at the gateway.
Error Feed: HDBSCAN soft-clustering plus the Sonnet 4.5 Judge writes immediate_fix; clusters span the five axes so cross-axis failures (PII climbs while hallucination rises) surface as one issue, not two pages.
Agent Command Center: OpenAI-compatible gateway in a single Go binary, 100+ providers, 18+ built-in guardrail scanners plus 15 third-party adapters; SOC 2 Type II, HIPAA, GDPR, CCPA certified per futureagi.com/trust (ISO/IEC 27001 in active audit); the gateway is where Cost per resolved conversation reads from and where the PII inline block fires.

Three honest tradeoffs. The full twelve-metric rubric stack on 100% of traffic is expensive; sample the LLM-judge rubrics at 5-10% and gate the deterministic ones at 100%. The Protect inline hop is not free; the 65 ms text latency is acceptable for chat but tight for voice speech-to-speech, where the Flash binary classifier is the right path. CSAT correlation is a 4-week lagging signal; do not expect a survey response in the same dashboard refresh as the trace.

Ready to wire the twelve? Start with ConversationResolution, TaskCompletion, and Groundedness in a pytest fixture against the ai-evaluation SDK, then add EvalTag instrumentation on production traces through traceAI once the CI baseline is set. Layer Protect inline and Error Feed on the cluster view as the volume comes in.

Frequently asked questions

Why twelve metrics across five axes instead of one CSAT score?

CSAT is a lagging, low-signal aggregate. It tells you the customer was unhappy; it does not tell you whether the failure was a hallucinated policy, a missed escalation, a tone slip, a memory loss in turn six, or a tool-call that doubled cost. The five-axis frame (coherence, resolution, safety, cost, adaptation) keeps those failure modes orthogonal. When the dashboard shows ConversationCoherence sliding while ConversationResolution holds, you know context is leaking between turns even though the user still got their answer. When PII leak rate climbs while Groundedness is steady, you know retrieval is fine but redaction broke. Aggregate the twelve into a single 'agent health' score and you lose the diagnostic. The metrics are useful because they decompose, not because they roll up.

Which of the twelve gate a release and which trend weekly?

Three are CI-gate metrics that block a deploy on regression: TaskCompletion (FAGI eval_id 99), Hallucination rate (Groundedness inverse, eval_id 47), and PII leak rate (DataPrivacyCompliance fails, eval_id 22). Four are SLO metrics that page on breach in production: Latency p95, Refusal accuracy on the negative test set, Escalation accuracy on the sensitive-intent slice, and ConversationResolution (eval_id 2). Five are weekly KPIs reviewed in the product meeting: ConversationCoherence (eval_id 1), ContextRetention, Tone consistency, Cost per resolved conversation, and CSAT correlation. The split is deliberate: pages cover outages and harm, gates cover release safety, weekly trends cover product drift. Mixing tiers turns the dashboard into noise.

What is the difference between ConversationResolution and TaskCompletion?

They score the same conversation from two angles. TaskCompletion (eval_id 99) is the agent-side judgment: the system prompt declared a goal, the rubric scores whether the transcript executed it. ConversationResolution (eval_id 2) is the user-side judgment: did the human leave with their actual problem closed. They diverge when the agent finished its declared task but the user still walked away frustrated, or when the agent never closed its checklist but the user got what they came for from a partial answer. Run both. Disagreements between them are the highest-signal cluster in the Error Feed because they surface either a misaligned goal definition or a flawed retrieval path the agent then papered over.

How do I measure ContextRetention without ground truth?

ContextRetention is the turn-to-turn memory question: in turn six, did the agent still know what the user said in turn one. FAGI ships CustomerAgentContextRetention as a dedicated cloud rubric that scores the transcript holistically. The cheap proxy is to bake a synthetic checkpoint into your test conversations (a name, an order id, a preference) and assert the agent uses it correctly downstream. In production, the rubric runs end-of-session over the full transcript and emits a 1-5 score plus the offending turn. Pair it with ConversationCoherence (eval_id 1), which catches the broader 'did the conversation make sense across turns' question. ContextRetention is the narrower probe; coherence is the wider one.

How do I track Cost per resolved conversation if the agent uses three different models?

Tag every span with the resolution outcome (resolved, escalated, abandoned) and the model id. The denominator is the count of resolved sessions; the numerator is total token spend across every model and every span in that session, including retries, retrieval, guardrail evaluations, and the LLM-judge rubric scoring itself. The right unit is cost-per-resolution, not cost-per-call: a cheap call that fails and re-routes is more expensive than an expensive call that closes the ticket. With traceAI the per-span token attributes feed a SQL view; with Agent Command Center the gateway tracks token totals per virtual key, so a single dashboard panel reads the dollar figure directly. Reset baselines after every model swap.

Why is CSAT only a proxy in this framework?

CSAT correlates with whether the user was helped, but it is a survey response with a 5-15% reply rate, a self-selection bias toward extremes, and a 24-72 hour lag. Treat it as one signal, not the truth. The eleven other metrics are higher-fidelity, faster, and decomposable; CSAT is the slow business validator that confirms the other eleven are pointing the right way. When CSAT drifts opposite to TaskCompletion + ConversationResolution, something in the metric definitions has decoupled from the customer's actual experience, and that is the alarm worth investigating. The proxy framing is what stops teams from optimizing for CSAT directly, which always backfires because the agent learns to be liked rather than to be useful.

How does FAGI wire these twelve into the same trace?

Every conversation is a span tree in traceAI. The Latency p95 metric reads root-span duration. Cost per resolved conversation reads the sum of token-cost attributes on every child span joined to the resolution outcome tag. The nine semantic metrics run as ai-evaluation rubrics scoring the captured transcript, either in CI against a fixed dataset or against a sampled production canary via EvalTag. PII leak rate is the inverse of DataPrivacyCompliance pass-rate (eval_id 22) running both inline as a Protect guardrail (block) and end-of-session as an audit rubric (catch). Error Feed clusters the failing traces into named issues with auto-written root cause, evidence quotes, and an immediate_fix. One span tree, three measurement paths, twelve dashboard panels.

View all

Guides

LLM Evaluation Metrics: Everything You Need in 2026

There aren't 50 LLM eval metrics. Three primitive families and eight rubrics matter in production. 2026 reference with CI gate and per-trace eval cascade.

NVJK Kartik · May 5, 2026

12 min

Guides

Best 5 Parea AI Alternatives in 2026

Five Parea AI alternatives scored on eval-catalog depth, logs-capped pricing, optimizer loops, guardrails, and team scale, and what each fixes.

NVJK Kartik · May 21, 2026

17 min

Guides

Best 5 RagaAI Alternatives in 2026

Five RagaAI alternatives scored on eval-judge depth, optimizer loops, gateway and guardrails, self-host ops burden, vendor maturity, and what each fixes.

NVJK Kartik · May 21, 2026

19 min

The five-axis frame

Axis 1: Coherence

1. ConversationCoherence (eval_id 1)

2. ContextRetention

3. Tone consistency

Axis 2: Resolution

4. ConversationResolution (eval_id 2)

5. TaskCompletion (eval_id 99)

6. Escalation accuracy

Axis 3: Safety

7. Hallucination rate (Groundedness inverse, eval_id 47)

8. PII leak rate (DataPrivacyCompliance fails, eval_id 22)

9. Refusal accuracy (AnswerRefusal, eval_id 88)

Axis 4: Cost

10. Latency p95 (end-to-end turn)

11. Cost per resolved conversation

Axis 5: Adaptation

12. CSAT correlation (proxy)

Wiring the twelve into traceAI and the Error Feed

How Future AGI ships the twelve

Related reading

Frequently asked questions