EDA — Exploratory Data Analysis — is the practice of summarizing, visualizing, and stress-testing a dataset before modeling. It uncovers distributions, missing values, outliers, label imbalances, and structural quirks that would otherwise corrupt downstream training or evaluation.

How is EDA different from feature engineering?

EDA is descriptive — it surfaces what the data looks like. Feature engineering is prescriptive — it transforms or constructs new columns based on what EDA revealed. EDA usually precedes feature engineering and informs every choice made in it.

How does EDA fit into an LLM evaluation workflow?

EDA on prompts, responses, retrieved chunks, evaluator scores, and trace metadata exposes systemic gaps — empty contexts, oversized prompts, biased cohorts. FutureAGI exposes Datasets, evaluator outputs, and trace spans as queryable artifacts so EDA stays reproducible.

What Is EDA (Exploratory Data Analysis)? FutureAGI Guide (2026)

What Is EDA (Exploratory Data Analysis)?

EDA — Exploratory Data Analysis — is the discipline of summarizing, visualizing, and stress-testing a dataset before modeling, to understand its distributions, missing values, outliers, label imbalances, and structural quirks. In an LLM context, the data being explored includes prompts, responses, retrieved chunks, evaluator scores, and trace metadata. FutureAGI does not ship a notebook environment, but exposes the underlying data EDA depends on: every fi.datasets.Dataset, evaluator score, and trace span is queryable, exportable, and pinned to a version so any analysis is reproducible across runs.

Why EDA Matters in Production LLM and Agent Systems

Skipping EDA is the single most common cause of misleading offline evaluation. A team builds a “golden dataset,” runs evaluators against it, declares a release ready — and three weeks later production complaints surface a cohort the dataset never included. Or worse: an evaluator reports a 0.92 average score on a dataset that turns out to be 80% one user segment, and the score collapses on real traffic.

ML engineers feel this when their offline benchmarks disagree with production telemetry. SREs see it as long-tail user-cohort failures that look random but are actually concentrated in segments under-represented in the training and eval sets. Product managers see it as the “we shipped this and got blindsided” pattern. Compliance teams care because audit-ready evaluation requires evidence that the eval set is representative — and EDA produces that evidence.

In 2026 LLM stacks, EDA also covers retrieval and trace data. EDA on prompt-token-length distributions reveals when context-window overflow risk is concentrated in one user segment. EDA on retrieval-recall scores reveals when the top-K threshold is wrong for a specific query class. EDA on Toxicity and PromptInjection evaluator distributions reveals shifts in attack patterns weeks before a guardrail breach. None of these signals come from “looking at the model”; they come from examining the data the model touches.

How FutureAGI Handles EDA Inputs

Because EDA is an analysis practice rather than a runtime surface, FutureAGI does not run EDA itself. FutureAGI’s approach is to provide the substrate — versioned datasets, evaluator outputs, and trace metadata — that makes EDA reliable. Every fi.datasets.Dataset has a row schema, a checksum, and a version; rows can be exported to CSV or pulled directly into pandas, Polars, or dplyr for analysis. Evaluator scores are stored alongside the row, so distributions across Groundedness, AnswerRelevancy, PromptInjection, and any custom evaluator are first-class data.

A real workflow: an LLM team building a customer-support assistant exports a 10K-row Dataset covering one quarter of production traces sampled through traceAI-langchain. They run EDA in a notebook — distribution of prompt lengths, missing-context rate, evaluator-score histograms by user segment, outlier detection on response latency. They find that one segment has 4x the empty-context rate of the others; that becomes a data-quality ticket against the retriever and a cohort filter for the next eval run.

The pattern is the same for evaluators. FutureAGI surfaces a per-evaluator score distribution per Dataset version; comparing the distribution across versions exposes drift, threshold mismatches, and label-rule changes that would otherwise hide inside an aggregated mean. We’ve found that requiring a documented EDA pass before promoting a Dataset version to “evaluation canon” eliminates a class of release-decision errors — the ones where the team thought a number was meaningful but it was hiding a 30% empty-context rate.

How to Measure or Detect EDA-Surfaced Issues

EDA itself is qualitative, but the issues it surfaces become measurable signals you can monitor:

Schema validation — every Dataset exports a column schema; mismatch with the consumer’s expected schema is an explicit failure.
Per-cohort row-count balance — confirms the Dataset is representative of production traffic; flag any cohort under 5% of expected.
Evaluator-score histograms — distribution of Groundedness, AnswerRelevancy, PromptInjection scores per Dataset version; bimodal distributions usually mean two underlying populations are being mixed.
Outlier detection on prompt and response lengths — long-tail prompts often correlate with context-overflow risk.
Missing-field rate — empty context, missing expected_response, null evaluator output; each is a data-quality ticket.
Cross-Dataset drift — comparing distributions between Dataset v1 and v2 surfaces silent data changes.

from fi.datasets import Dataset

# Pull a dataset version into pandas for EDA
dataset = Dataset.from_id(dataset_id="customer_support_q1_2026_v3")
df = dataset.to_dataframe()
print(df["prompt_length"].describe())
print(df.groupby("user_segment")["evaluator_groundedness"].mean())

Common Mistakes

Treating EDA as a one-time activity. Production data shifts; EDA on the eval Dataset must be redone every refresh, not just at v1.
Eyeballing means without distributions. A mean of 0.85 hides whether scores are tightly clustered or bimodal; always plot the histogram.
Skipping per-cohort splits. A balanced overall dataset can be unbalanced inside the cohorts that matter.
Reporting numbers without a Dataset version pinned. Without the version, the same number means different things on different days.
No documented exit criteria. “We did EDA” is not a release gate; document what was checked and what passed.