Guides

LLM Eval vs Product Analytics: Two Layers, One Loop (2026)

Product analytics measures user behavior. LLM eval measures system behavior. The 2026 PM and ML engineer guide to keeping them separate and joining them on a shared identifier.

·
Updated
·
11 min read
llm-evaluation product-analytics ai-observability ai-gateway agent-evaluation 2026
Editorial cover image for LLM Eval vs Product Analytics: The 2026 Bridge for Product Teams
Table of Contents

A product analytics dashboard tells you that 12 percent of users abandoned the AI chat after their second message. It will not tell you why. It cannot. Mixpanel, Amplitude, PostHog, and Heap count events; they do not score the quality of an LLM output against a rubric. PMs and ML engineers keep treating this as a tooling gap. It is not. Product analytics measures user behavior. LLM eval measures system behavior. They answer different questions, on different telemetry, with different ground truth. Confuse them and you optimize CTR while shipping a hallucination factory, or chase a faithfulness score on a feature no one finishes. This post is the conceptual map plus the one integration pattern that joins the two without collapsing either.

TL;DR

QuestionRight layer
Did the user click, finish, retry, retain?Product analytics
Was the answer grounded? Did the agent refuse correctly?LLM eval
Which conversations had bad outcomes?Product analytics
Why did those conversations fail?LLM eval
What did each call cost?Gateway telemetry
Did a rubric drop correlate with a retention drop?Both, joined on session.id

One sentence: product analytics is descriptive of users, LLM eval is prescriptive of the system, and the bridge is a shared identifier plus a warehouse join.

The two layers measure different things

The temptation when an AI feature ships is to treat it like a checkout funnel and extend the existing analytics tool. That reflex covers maybe 20 percent of the eval discipline. The other 80 percent (rubric design, golden-set construction, judge calibration, CI gate thresholds, span-attached scores) has no analog in event-based analytics. Naming the mismatches is how you stop trying to force one layer to do the other layer’s job.

Telemetry shape: event vs span tree

Product analytics is an event stream per session. Each event is a flat key-value record (event_name, timestamp, user_id, session_id, a handful of properties). The data model is wide and shallow.

LLM telemetry is a span tree per turn. Each turn has a parent span for the LLM call, often with child spans for retrieval, tool invocations, guardrails, and downstream model calls. Each span carries dozens of attributes: prompt tokens, completion tokens, retrieval context, tool arguments, latency, cost, model name, judge score. A single chat session that produces 12 user-visible events in Mixpanel produces hundreds of OTel spans in traceAI. Flatten the LLM layer into events and you lose the parent-child structure that explains why an answer failed. The span vs trace post covers the data model.

Ground truth: implicit vs explicit

Product analytics ground truth is implicit. If a user clicks “purchase,” they completed a purchase. The event itself is the truth.

LLM eval ground truth is explicit. You need a labeled golden set (input, expected behavior, retrieval context, expected tool calls) versioned alongside prompts and grown weekly from production traces. Without it, “the answer was good” means whatever the person looking at the answer wanted it to mean. The golden set is the single most under-built primitive in product teams shipping their first AI feature. The LLM evaluation playbook covers dataset construction in depth.

Metric definition: math vs rubric

Product metrics are math-derived. Funnel conversion is “events of type B over type A within window W.” DAU is “distinct user_id on day D.” Definition lives in SQL or the metric builder. Minutes of work per metric.

LLM metrics are rubric-encoded. Faithfulness is “the response asserts nothing the retrieval context does not support.” Task completion is “the agent fulfilled the user’s intent end-to-end.” Definition lives in a rubric prompt or a classifier, takes hours to days, and requires calibration against human labels. Future AGI’s ai-evaluation SDK ships 60+ EvalTemplate classes (Groundedness, ContextAdherence, Completeness, ChunkAttribution, TaskCompletion, AnswerRefusal, EvaluateFunctionCalling, PromptInjection) that absorb the bulk of the rubric authoring you would otherwise hand-roll.

Cohort: user property vs query class

In product analytics, a cohort is a user property: country, plan tier, signup week, A/B variant. Stable across the user’s lifetime or changing slowly.

In LLM eval, a cohort is query class plus persona plus intent. The same user can produce traces in five cohorts in one session: factual question about pricing, complaint about billing, request to escalate, follow-up clarification, small talk. Each cohort gets evaluated against a different rubric set. Average them and you get a useless aggregate score. The custom eval metrics playbook covers cohort design.

Action loop: UI vs prompt

When product analytics surfaces a problem, the PM iterates on UI: change the copy, reorder the funnel, change a default, run an A/B test. The artifact is markup.

When LLM eval surfaces a problem, the team iterates on prompts, rubrics, retrieval, and classifiers. The artifact is a prompt version, a rubric version, a chunking strategy, a tool description. Different optimization tooling (Future AGI’s agent-opt ships six optimizers including BayesianSearchOptimizer, MetaPromptOptimizer, ProTeGi, and PromptWizardOptimizer). The two loops run in parallel; one team should never own both.

Where the two layers connect

The categories stay separate. The seams are where the diagnostic value compounds. Three connections that turn two dashboards into one feedback loop.

Eval signal predicts CSAT

A faithfulness drop today is a thumbs-down tomorrow and a churn signal next month. The lag varies (hours for power users, days for the long tail), but the direction is consistent across the customer agents and RAG deployments we have instrumented: rubric regressions on Groundedness and TaskCompletion lead retention regressions by a measurable window. That window is the actionable one. By the time the product analytics dashboard shows a retention dip, the rubric has been amber for days.

This is why eval is not a research artifact. It is an early-warning surface for the product KPI you actually care about. Wire the rubric drop into the same alerting channel as the retention dip and you act in the lead window.

Conversation ID is the join key

Product analytics knows which sessions ended badly. LLM eval knows why each turn was bad. Neither half tells the whole story alone. The connection is a shared identifier on both sides:

  • Product code emits session_id into Mixpanel, Amplitude, PostHog, or Heap on every event.
  • traceAI sets the same value as the standard session.id OTel attribute on every span.
  • Both streams land in the warehouse. A join on session_id produces a row per conversation with quality, cost, and outcome.
from fi_instrumentation import register, ProjectType
from openinference.instrumentation.openai import OpenAIInstrumentor
from opentelemetry import trace

register(project_type=ProjectType.OBSERVE, project_name="shopping-assistant-prod")
OpenAIInstrumentor().instrument()
tracer = trace.get_tracer(__name__)

def handle_turn(user_id: str, session_id: str, message: str):
    with tracer.start_as_current_span("chat-turn") as span:
        span.set_attribute("session.id", session_id)
        span.set_attribute("user.id", user_id)
        span.set_attribute("input.value", message)
        return call_llm(message)

The same session_id flows into the message_sent Mixpanel event. One key, two streams, joinable in SQL. Skip this step and every cross-layer analysis becomes a manual cross-reference, which is to say it never happens.

Error Feed clusters become a product-analytics dimension

Eval scores by themselves are scalars. Error Feed pulls failing production traces, clusters them with HDBSCAN soft-clustering, and runs a Sonnet 4.5 Judge that writes an immediate_fix string per cluster. The cluster ID becomes a new column on your joined table. Conversations in cluster retrieval-stale-pricing showed a 23 percent retention drop versus baseline last week; cluster tool-arg-format-mismatch correlated with 41 percent third-turn abandonment. Those are joins the product team cannot make without the eval-side clustering, and they are the joins that turn a rubric regression into a prioritized backlog item.

Today the only Error Feed integration is Linear (auto-ticket per cluster with the failing traces, the immediate fix, and affected user count). Slack, Jira, and PagerDuty are on the roadmap. The clustering signal itself ships now.

Per-system pick guide

Every shipping AI feature needs both layers. The weighting differs by system shape. Lean the wrong way and you over-invest in the surface that already has signal while the silent failure mode goes uninstrumented.

SystemHeavier onWhyMinimum from the other side
Consumer chat, support botsProduct analyticsUser behavior is the dominant outcome signal; abandonment and thumbs-down move daily.Groundedness, AnswerRefusal, Toxicity on a 5-10% production sample.
RAG search and Q&ABalancedRetention measures whether users found anything; eval measures whether the answer was right.Both: per-turn Groundedness and ContextAdherence, plus session-level abandonment and click-through.
Internal copilots, dev toolsLLM evalFunnels are tiny; verifiable correctness is the entire product.A retention or DAU read on whether anyone is using the feature at all.
Agentic workflows (multi-step, tool calls)LLM evalTrajectory failures are invisible to event counters; a polite wrong trajectory looks clean in analytics.Conversion-to-outcome at the task level (did the agent’s plan resolve the user’s goal).
Voice and real-timeBoth heavyUser patience is short and turns are expensive; both signals matter immediately.Plus a third layer: latency p95 and barge-in rate from the gateway.

The pattern: the more verifiable the system, the heavier the eval weight. The more behavioral the user surface, the heavier the analytics weight. Neither replaces the other; the table answers which one to instrument first when you only have one quarter.

The dashboards that need both

Joint dashboards are the deliverable that separates teams that read this post from teams that ship from it. Three tiles, side by side, one view.

Tile 1: quality

Average per-rubric score over the rolling window. Groundedness, ContextAdherence, TaskCompletion, plus the two or three custom rubrics specific to your domain. Trend line, threshold band, alert on sustained drop below threshold. The source is ai-evaluation SDK output piped to the warehouse.

from fi.evals import Evaluator
from fi.evals.templates import Groundedness, ContextAdherence, TaskCompletion
from fi.testcases import TestCase

evaluator = Evaluator(fi_api_key="...", fi_secret_key="...")
results = evaluator.evaluate(
    eval_templates=[Groundedness(), ContextAdherence(), TaskCompletion()],
    inputs=[TestCase(
        query=user_message,
        response=agent_response,
        context=retrieval_context,
        metadata={"session_id": session_id, "user_id": user_id},
    )],
)
for r in results.eval_results:
    write_to_warehouse(session_id, r.eval_name, r.metrics[0].value)

Tile 2: cost

Per-call USD from the gateway, broken down by route, model, and fallback rate. Most teams underestimate per-call LLM cost by 5 to 10x because they only see the aggregate provider invoice. The Agent Command Center gateway exposes canonical headers on every response (base URL https://gateway.futureagi.com/v1):

  • x-prism-cost: canonical USD for this call
  • x-prism-latency-ms: end-to-end latency
  • x-prism-model-used: model after routing
  • x-prism-fallback-used: true if a fallback fired
  • x-prism-routing-strategy: which routing rule applied
  • x-prism-guardrail-triggered: true if a guardrail blocked

Capture them in the client and emit as event properties on message_received. Now per-call cost lives as a dimension in every Mixpanel, Amplitude, PostHog, or Heap dashboard, and the cost question stops being “what is the bill” and starts being “which user segment is most expensive per resolved conversation.” The agent cost optimization post covers the discipline.

Tile 3: outcome

Conversation resolution rate, abandonment rate, retry rate, thumbs-down rate from your product events. The source is whatever you already ship to Mixpanel, Amplitude, PostHog, or Heap. Nothing changes here except that this tile sits next to the other two instead of in a separate tab.

The join

All three tiles read from one warehouse view:

WITH eval_scores AS (
  SELECT
    session_id,
    AVG(score) FILTER (WHERE rubric_name = 'Groundedness') AS groundedness,
    AVG(score) FILTER (WHERE rubric_name = 'TaskCompletion') AS task_completion,
    SUM(cost_usd) AS total_cost_usd
  FROM llm_eval_scores
  WHERE event_time >= CURRENT_DATE - 7
  GROUP BY session_id
),
outcomes AS (
  SELECT
    session_id,
    user_id,
    MAX(event_name = 'conversation_resolved')::int AS resolved,
    MAX(event_name = 'conversation_abandoned')::int AS abandoned,
    MAX(event_name = 'feedback_thumbs_down')::int AS thumbs_down
  FROM product_events
  WHERE event_time >= CURRENT_DATE - 7
  GROUP BY session_id, user_id
)
SELECT *
FROM eval_scores e
INNER JOIN outcomes o USING (session_id);

Joint alerts read from the same view. A rubric drop alone is noise. A retention drop alone is a mystery. A rubric drop plus a retention drop is a priority alert. The three classes worth wiring:

  • Quality drop with outcome drop. Groundedness falls 5 points and abandonment rises 10 percent. Page the on-call.
  • Cost spike without quality gain. Per-session cost rises 30 percent without a measurable rubric improvement. Open a ticket.
  • Cluster-tagged outcome regression. Conversations in an Error Feed cluster show retention drop versus baseline.

Anti-patterns to avoid

Five failure modes, all common, all fixable.

Replacing product analytics with LLM eval. “We have rubric scores, we do not need Mixpanel.” You just lost the user-outcome signal. Eval tells you whether each answer was faithful; it cannot tell you whether the user accomplished what they came for.

Replacing LLM eval with product analytics. “Retention dropped, we do not need rubrics.” You lost the per-rubric diagnosis. Retention drops have a hundred causes; without rubric scores you are guessing.

No shared identifier. Mixpanel events have session_id; traceAI spans do not. You can see both streams but cannot join them.

Cost telemetry separate from analytics. Aggregate provider spend tracked; per-user and per-feature cost untracked. Five lines wiring gateway headers into event properties closes weeks of recovered visibility.

Eval alerts with no product context. Rubric-drop pages without a user-impact attachment get ignored, and then a real outcome regression hits the same channel and also gets ignored. Joint alerts fix this.

How Future AGI implements the bridge

Future AGI is built around exactly this split. traceAI and ai-evaluation live on the same self-hostable runtime; the Agent Command Center gateway emits the cost telemetry that completes the picture. Both Apache 2.0.

  • traceAI is OTel-native and auto-instruments 50+ AI surfaces across Python, TypeScript, Java, and C# (Spring Boot starter, Spring AI, LangChain4j, Semantic Kernel included). session.id and user.id are standard span attributes that align with your product analytics keys, so the join key flows through both layers without translation. 14 span kinds (TOOL, RETRIEVER, AGENT, EVALUATOR, GUARDRAIL, VECTOR_DB, A2A_CLIENT, A2A_SERVER, plus the OTel base set).
  • ai-evaluation ships 60+ EvalTemplate classes as both pytest CI scorers and span-attached online scorers, so the same rubric definition runs offline (CI gate) and live (production sampling). Four distributed runners (Celery, Ray, Temporal, Kubernetes) hit a product-analytics-friendly cadence: nightly batch over yesterday’s traffic, per-PR gate against the regression set, 1-10% production sampling, all on the same rubric definitions.
  • Agent Command Center is the gateway layer. Canonical cost, latency, model, fallback, and routing headers on every response, OpenAI-compatible across 20+ providers, six native adapters (OpenAI, Anthropic, Gemini, Bedrock, Cohere, Azure) plus OpenAI-compatible presets. Capture the headers as event properties and the cost tile populates itself.
  • Error Feed lives inside the eval stack. HDBSCAN soft-clusters failing production traces; a Sonnet 4.5 Judge writes a 4-D trace score, a 5-category 30-subtype taxonomy classification, and an immediate_fix per cluster. Linear auto-tickets ship today; Slack, Jira, and PagerDuty are on the roadmap. The cluster IDs are the new dimension your product analytics joins on.
  • Self-improving evaluators on the Platform layer retune rubrics from production thumbs-up and thumbs-down feedback. Feedback comes in through your existing product UI, routes through the eval layer, improves the rubric automatically. The connector from your analytics tool is the small adapter you write once.

Most teams running both layers end up running three tools (analytics, tracing, eval) and stitching warehouse joins. Future AGI collapses two of those three onto one plane and emits the third (product analytics) the keys it needs to join cleanly. The observability vs evaluation vs benchmarking primer covers the adjacent boundaries.

Conclusion

Product analytics and LLM evaluation are two layers, not one. The first measures user behavior; the second measures system behavior. Compose them via a shared identifier, join in the warehouse, and build joint dashboards plus joint alerts. Teams that extend the analytics tool to cover quality lose 80 percent of the eval discipline; teams that swap analytics for eval lose all the user-outcome signal. The bridge is shippable this week with traceAI, ai-evaluation, and the Agent Command Center gateway. Pick the shared identifier first. Everything else compounds from there.

Frequently asked questions

What is the difference between LLM evaluation and product analytics?
Product analytics measures user behavior: did the user click, finish the funnel, retry, churn, retain. LLM evaluation measures system behavior: was the answer grounded, did the agent refuse correctly, did the tool call succeed. They answer different questions on different telemetry. Product analytics is an event stream with implicit ground truth from user actions. LLM eval is a span tree with explicit ground truth from a rubric and a labeled golden set. Optimizing one without the other ships a CTR-tuned hallucination factory, or a perfectly grounded answer to a question no user finished asking.
Can I just build an LLM quality dashboard in Mixpanel, Amplitude, PostHog, or Heap?
You can build the bottom 20 percent of one. Event-based analytics counts named user events and divides them. LLM eval scores rubric correctness on a span tree using an explicit golden set, a calibrated judge, and a CI gate. The shape of the data (flat event versus span tree), the source of truth (user action versus rubric prompt), and the action loop (UI change versus prompt or rubric change) do not match. Treat product analytics and LLM eval as two layers that join on a shared identifier, not one tool that does both badly.
How do I connect LLM eval scores to product outcomes like retention and CSAT?
Pick one shared identifier (session.id, user.id, or conversation_id) and emit it from both sides. Your product code emits the identifier into Mixpanel, Amplitude, PostHog, or Heap on every event. Your traceAI spans set the same value as a standard OTel attribute on every LLM call. Both streams land in the warehouse. A single join produces a row per conversation with rubric scores, cost, and outcome side-by-side. That row is the artifact the joint dashboards and joint alerts read from.
Which system needs LLM eval and which one needs product analytics?
Every shipping AI feature needs both, but the weighting differs. Internal tools and copilots where ground truth is verifiable lean heavier on LLM eval because the user-funnel signal is small. Consumer chat, support bots, and search lean heavier on product analytics because user behavior carries most of the outcome signal. Agentic workflows with tool calls and multi-step plans need the deepest LLM eval coverage because trajectory failures are invisible to event counters. RAG systems sit in the middle: groundedness from eval, abandonment from analytics, joined on session ID.
What does a joint dashboard for an AI feature actually contain?
Three tiles in one view, not three tabs. Quality: average per-rubric score (Groundedness, ContextAdherence, TaskCompletion) over the rolling window with a threshold band. Cost: per-call USD from the gateway, broken down by route, model, and fallback rate. Outcome: conversation resolution rate, abandonment, retry, thumbs-down from your product events. The three together answer whether you are shipping a quality, affordable, useful AI feature, or one of the three with the other two silently broken. Alert on rubric drop plus outcome drop together, not separately.
How does Future AGI bridge LLM eval and product analytics?
traceAI follows OpenTelemetry GenAI semantic conventions, so session.id and user.id on a span match the keys your product code already emits into Mixpanel, Amplitude, PostHog, or Heap. The ai-evaluation SDK writes per-rubric scores against the same identifiers. The Agent Command Center gateway emits canonical cost and routing headers (x-prism-cost, x-prism-latency-ms, x-prism-model-used, x-prism-fallback-used) that you capture as event properties. Error Feed clusters failing traces with HDBSCAN and runs a Sonnet 4.5 Judge that writes an immediate_fix per cluster; cluster IDs become a new dimension you join to your product outcomes.
Related Articles
View all