How is an evaluation store different from a dataset?

A dataset holds rows and fields. An evaluation store includes datasets plus evaluator definitions, run metadata, score distributions, thresholds, production trace links, human labels, and release decisions.

How do you measure an evaluation store?

In FutureAGI, use the sdk:Dataset surface through fi.datasets.Dataset and attach evaluators such as Groundedness, ContextRelevance, and ToolSelectionAccuracy. Track replay success, eval-fail-rate-by-cohort, score reproducibility, and trace-to-row coverage.

What Is an Evaluation Store? FutureAGI Guide (2026)

What Is an Evaluation Store?

An evaluation store is the system of record for LLM and agent evaluation data: datasets, evaluator configs, run results, thresholds, traces, labels, and score history. It is an evaluation-infrastructure concept used in offline eval pipelines and production quality monitoring. In FutureAGI, it maps to the sdk:Dataset workflow for replayable scoring. Instead of treating scores as disposable CI output, an evaluation store preserves row-level evidence so teams can compare releases, reproduce failures, audit decisions, and decide whether a prompt, model, retriever, or tool change is safe to ship.

Why Evaluation Stores Matter in Production LLM and Agent Systems

Without an evaluation store, quality data fragments across CI logs, spreadsheets, dashboards, trace tools, annotation queues, and one-off notebooks. A release can pass because the latest average score looks fine, while the row-level failures, evaluator version, threshold, and source traces are already gone. The next incident then starts with archaeology: which prompt ran, which model answered, which retriever index was active, and which evaluator produced the score?

The pain spreads quickly. Developers cannot reproduce failed eval rows. SREs see eval-fail-rate-by-cohort rise but cannot connect failures to the release that changed the prompt. Product teams lose confidence in A/B decisions because old score distributions were overwritten. Compliance teams cannot show why a safety gate passed on 2026-05-07.

Agentic systems make this harder than single-turn LLM calls. One user request may include planning, retrieval, tool selection, schema validation, model fallback, and a final answer. If the evaluation store does not preserve source_trace_id, dataset_version, evaluator_class, metric_threshold, and agent.trajectory.step, the team sees only the final failure. Common symptoms include green CI with rising thumbs-down rate, reruns that produce different pass rates, evaluator thresholds edited after the fact, and production traces that cannot be promoted into regression coverage.

How FutureAGI Handles Evaluation Stores

FutureAGI’s approach is to treat the evaluation store as a replayable reliability ledger, not a folder of score exports. The specific FAGI anchor is sdk:Dataset, implemented through fi.datasets.Dataset. Engineers create or import rows, add columns such as input, expected_response, context, rubric, cohort, source_trace_id, prompt_version, and dataset_version, then attach evaluator runs with Dataset.add_evaluation. The store keeps eval stats, run prompts, optimizations, and row-level scores together.

A real workflow: a support-agent team imports 12,000 production-derived cases into a Dataset. Each row keeps the retrieved context, final answer, tool call, trace id, and human-reviewed label. The team runs Groundedness for RAG answers, ContextRelevance for retrieved context, and ToolSelectionAccuracy for agent tool calls. A new prompt raises the global pass rate from 0.89 to 0.92, but the billing cohort drops below its metric_threshold of 0.86. The engineer opens the failed rows, sees that refund-tool traces changed after a tool schema update, and blocks the release until the regression eval passes.

Unlike an MLflow-only run table, an evaluation store must keep row evidence, evaluator configuration, thresholds, and production trace lineage in the same workflow. FutureAGI also connects traces from integrations such as traceAI-langchain, including fields like llm.token_count.prompt and agent.trajectory.step, so failed production traces can become reviewed dataset rows instead of screenshots in an incident doc. In our 2026 evals, the strongest evaluation stores answer three questions fast: what failed, why did it pass before, and which release changed the evidence?

How to Measure or Detect Evaluation Store Quality

Measure the store by whether it makes eval evidence reproducible and actionable:

Replay success rate: percentage of prior eval runs that can be rerun with the same dataset version, evaluator class, threshold, and prompt version.
Trace-to-row coverage: share of failed production traces with a linked source_trace_id that can be promoted into a reviewed Dataset row.
Score reproducibility: difference between original and replayed Groundedness, ContextRelevance, or ToolSelectionAccuracy scores on the same rows.
Eval-fail-rate-by-cohort: dashboard signal showing which dataset cohort, prompt version, retriever index, or tool route regressed.
Decision lineage: every deploy decision should link to run_id, dataset_version, metric_threshold, and reviewer status.
User-feedback proxy: rising thumbs-down or escalation rate with stable offline scores means the store may be missing live failure modes.

from fi.datasets import Dataset
from fi.evals import Groundedness, ContextRelevance

store = Dataset.get("support-eval-store", version="2026-05-07")
store.add_evaluation(Groundedness())
store.add_evaluation(ContextRelevance())

Common Mistakes

Saving only aggregate scores. A 0.91 pass rate is not enough; keep row inputs, outputs, references, evaluator configs, and thresholds.
Overwriting evaluator definitions. If the judge prompt or evaluator version changes, old scores need their original config for fair comparison.
Treating traces and evals separately. Production failures should carry trace ids into dataset rows, or regression coverage will lag incidents.
Changing thresholds without history. Threshold edits should be versioned with reviewer notes and release context, not patched into dashboards.
Confusing storage with governance. A warehouse table becomes an evaluation store only when it preserves lineage, replay, thresholds, and decisions.