Why does granularity matter for LLM observability?

Without span-level granularity you cannot drill into a failure to find which step or which evaluator failed. Coarse aggregates hide cohort-specific regressions and make root-cause analysis nearly impossible.

How does FutureAGI handle granularity?

FutureAGI traces capture span-level granularity — agent.trajectory.step, llm.token_count, evaluator scores per row — and let engineers aggregate up. The default is high granularity with controlled retention.

Data Granularity: Definition & FutureAGI Guide (2026)

What Is Data Granularity?

Data granularity is the level of detail at which data is captured, stored, and analyzed. High granularity means many fine-grained rows — per-event traces, per-span attributes, per-token counts, per-row evaluator scores. Low granularity means coarse aggregates — per-day, per-tenant, per-cohort. Granularity is a design choice that determines which questions a system can answer, how much storage and compute it costs, and which privacy obligations apply. In LLM observability, FutureAGI captures span-level granularity by default so engineers can drill into evaluator failures rather than reach for an aggregate that hides them.

Why Data Granularity Matters in Production LLM and Agent Systems

When something breaks, granularity determines what you can investigate. A daily aggregate showing 3% eval-fail-rate tells you nothing about whether one tenant is at 30% and the rest are at 0%. A token-cost dashboard with only weekly totals masks a 10x cost spike that happened on Tuesday afternoon. A trace recorded at workflow level instead of span level cannot answer which retrieval call returned the wrong policy.

The pain is concrete. ML engineers cannot reproduce a regression because the aggregated logs lost the offending row. SREs see latency aggregates that smooth over a cohort outage. Product teams launch a feature, see flat metrics, and miss that the new flow is silently degrading the same 2% of users every day. Compliance teams need request-level evidence under audit; aggregate logs cannot satisfy “show me the policy version applied to this specific user.”

In 2026 agent stacks, granularity matters more because trajectories produce 5–20 spans per request. Aggregating to “request” level loses the step where the failure happened. The opposite extreme — capturing every byte of every embedding lookup — is wasteful and creates privacy exposure. Useful symptoms of wrong granularity: regressions that only show up by drilling into traces, dashboards where cohort filters reveal patterns invisible in totals, and audit requests that the data store can’t answer.

How FutureAGI Handles Data Granularity

FutureAGI’s approach is “capture high, aggregate up.” Traces from traceAI-langchain, traceAI-openai-agents, and traceAI-mcp record span-level data with attributes including agent.trajectory.step, llm.token_count.prompt, llm.token_count.completion, span duration, evaluator name, and decision. Each Dataset row stores its source id, ingestion timestamp, reviewer, evaluator output, and bin labels.

A practical workflow: an SRE sees an aggregate eval-fail-rate climb from 2% to 4% on a Groundedness metric. They drill from the dashboard’s daily aggregate into per-route, then per-prompt-version, then per-trace, then per-span. The granular trace shows the failing step had retrieved a particular vendor source. The fix is targeted: quarantine the source, add a regression eval, and re-run only the affected cohort. Without span-level granularity, the same investigation would have stopped at “the metric got worse.”

Agent Command Center routing policies use granular data too: cost-optimized routing decides per-request, not per-tenant, by reading the prompt-token bin. model fallback triggers on per-trace evaluator scores, not on hourly averages. Unlike Prometheus recording rules configured only as daily rollups, FutureAGI’s design preserves span-level evidence with retention controls so privacy obligations are still met. The engineer’s next move is concrete: drill into the trace, run a regression eval against the affected slice, and tighten the route or rubric.

How to Measure or Detect Data Granularity

Granularity itself is observable as a property of your data store:

Span-attribute coverage — share of spans carrying agent.trajectory.step, llm.token_count.*, evaluator-score, route, and prompt-version.
Drill-down depth — number of dashboard levels (route → prompt-version → trace → span) supported without re-aggregating.
AggregatedMetric outputs — bins computed on top of granular rows; if an aggregator can’t reproduce a daily total from raw rows, granularity is broken.
Audit-readiness — time to answer a “show me request X” query end-to-end.
Storage cost per million spans — granularity has a price; track it so retention policy stays defensible.

from fi.evals import AggregatedMetric, GroundTruthMatch

agg = AggregatedMetric(metrics=[GroundTruthMatch()])
# Aggregate up from row-level scores; granular rows must remain queryable.

Common mistakes

Capturing only aggregates. Daily totals are cheap and useless when a regression hits.
Capturing too much without retention. Span-level granularity needs a retention policy or the bill gets ugly fast.
Skipping span attributes. A span without agent.trajectory.step or evaluator name is half a record.
Confusing granularity with privacy. Coarse data can still expose individuals via re-identification; granularity choice does not replace minimization.
Aggregating at write time. Pre-aggregating destroys evidence that auditors and engineers later need.