What is exploratory data analysis (EDA) in AI?

Exploratory data analysis is the first-pass investigation of a dataset's structure, distributions, missing values, labels, outliers, and cohort coverage. In AI reliability, it helps teams catch biased cohorts, stale labels, retrieval gaps, and drift before eval scores become misleading.

How is EDA different from data cleaning?

EDA discovers what the data looks like and where risk sits; data cleaning fixes selected issues. A good workflow profiles distributions and cohorts first, then cleans only the records or fields that affect evaluation, training, or monitoring decisions.

How do you measure EDA findings with FutureAGI?

Use `sdk:Dataset` to version rows and attach cohort metadata, then track missing-field rate, outlier rate, eval-fail-rate-by-cohort, and evaluator signals such as `ContextRelevance` or `JSONValidation` after EDA finds risky slices.

What Is Exploratory Data Analysis? FutureAGI Guide (2026)

What Is Exploratory Data Analysis (EDA)?

Exploratory data analysis (EDA) is the first-pass investigation of a dataset’s structure, distributions, missing values, outliers, labels, and cohort coverage before AI training, evaluation, or monitoring. It is a data reliability practice that shows up in eval pipelines, RAG corpus review, and production trace analysis. In FutureAGI, EDA usually starts in sdk:Dataset, where engineers inspect row quality and cohort coverage before trusting evaluator scores.

Why EDA Matters in Production LLM and Agent Systems

Bad EDA turns dataset problems into model problems. A support agent may appear to fail billing questions because the eval set over-samples one obsolete policy. A RAG pipeline may look accurate because the dataset never includes low-recall queries, long-tail document types, or questions whose answer changed after a policy update. A classifier may pass aggregate accuracy while a protected or high-value cohort has a much higher false-refusal rate. The failure mode is not only dirty data; it is hidden data shape.

The pain is shared. Developers waste time tuning prompts against mislabeled rows. SREs see rising thumbs-down rate without a matching latency or 5xx incident. Compliance teams cannot explain whether failures cluster around region, language, user type, or source policy. Product teams ship a model swap and later learn that the validation set was missing the new workflow the release was meant to improve.

Agentic systems make EDA more important because each row can represent a multi-step trace, not a single prompt. One dataset record may include user intent, retrieved context, tool calls, intermediate messages, final answer, cost, latency, and human feedback. Useful symptoms include high null rate in expected_response, skewed intent distribution, repeated near-duplicate traces, source documents with old policy_version, sudden long-context outliers, and evaluator failures concentrated in one cohort. Unlike a single-turn benchmark, an agent eval set needs EDA across both input data and trajectory evidence.

How FutureAGI Handles Exploratory Data Analysis

FutureAGI’s approach is to connect EDA to the evaluation unit rather than treating it as a notebook that disappears after setup. The specific surface is sdk:Dataset, exposed in the SDK as fi.datasets.Dataset. A dataset can hold rows, columns, imported files, run prompts, evaluations, eval stats, and optimization history, so EDA findings can stay attached to the same rows later used for regression checks.

A real workflow: a health-support RAG team imports 20,000 anonymized questions into a FutureAGI Dataset with columns for input, expected_response, reference_context, source_url, policy_version, locale, intent, and cohort. During EDA, the engineer finds that Spanish-language rows are 4% of the dataset but 18% of production traffic, and that 11% of high-risk medication rows use an expired policy version. They do not tune the prompt first. They rebalance the eval cohort, mark stale rows for review, and attach ContextRelevance, Groundedness, and JSONValidation to the refreshed dataset.

When the refreshed dataset runs, the engineer tracks eval-fail-rate-by-cohort and inspects traces from traceAI-langchain with fields such as llm.token_count.prompt to separate retrieval bloat from answer-quality regression. If one cohort fails, the next action is targeted: update source documents, split the cohort into a golden dataset, set a release threshold, or route risky traffic through a fallback. Unlike Great Expectations, which mainly validates table contracts, this keeps distribution evidence tied to LLM and agent behavior.

How to Measure or Detect EDA Findings

EDA produces signals, not one universal score. Track them at row, cohort, and release-gate level:

Schema and field health: missing input, expected_response, reference_context, source_url, or policy_version fields should block high-trust eval use.
Distribution drift: compare current rows with a baseline-distribution; use population stability index when cohort mix changes between dataset versions.
Outlier rate: flag extremely long prompts, high token counts, unusually large retrieved contexts, and rare tool paths before scoring aggregates.
Evaluator concentration: ContextRelevance returns context-quality evidence for retrieved passages; cohort clusters of low scores point to corpus or query coverage gaps.
Dashboard signals: monitor duplicate-row rate, stale-source rate, reviewer-disagreement rate, missing-trace-attribute rate, and eval-fail-rate-by-cohort.
User-feedback proxy: sample thumbs-down, escalation, refund, and correction traces back into the dataset, then rerun EDA on the newly added rows.

from fi.evals import ContextRelevance

evaluator = ContextRelevance()
result = evaluator.evaluate(
    input=row["input"],
    context=row["reference_context"],
)
print(result.score, result.reason)

Common Mistakes

Starting with charts instead of questions. Profile cohorts tied to release risk: locale, intent, tool path, policy version, account type, and model version.
Trusting aggregate averages. A clean overall score can hide a failing high-risk cohort that represents only 3% of the dataset.
Cleaning before profiling. Removing outliers first can erase exactly the traces that explain runaway cost, context overflow, or refusal spikes.
Skipping label review after EDA. Distribution checks do not prove expected_response is correct; sample labels where failures or outliers cluster.
Letting notebook findings drift away. Store EDA-derived cohort tags on the dataset so regression evals can keep using them.