What Is Data Cleaning (ML)?
The process of fixing or removing bad dataset rows before ML training, LLM evaluation, retrieval, or production trace analysis.
What Is Data Cleaning (ML)?
Data cleaning in ML is the process of finding, fixing, or removing inaccurate, duplicated, missing, mislabeled, or malformed records before data is used in an eval pipeline, retrieval index, training set, or production trace review. It is a data reliability practice: the goal is to keep model and agent measurements from being distorted by bad rows. In FutureAGI, cleaned data is usually stored as a versioned fi.datasets.Dataset so engineers can attach evaluators and compare changes safely.
Why Data Cleaning Matters in Production LLM and Agent Systems
Dirty data makes the measurement layer noisy before the model is involved. A duplicated support transcript can overweight one failure class in a regression eval. A missing expected_tool field can make a tool-selection score look better than it is because the row is silently skipped. A mislabeled refusal case can teach a judge rubric that unsafe answers are acceptable. The result is not one obvious crash; it is false confidence.
The pain spreads across roles. Developers chase prompt regressions that are really label defects. SREs see eval-fail-rate-by-cohort move without any matching latency, token-cost, or error-rate change. Compliance teams cannot audit why one row was excluded or corrected. Product teams ship a retriever update because aggregate faithfulness looks stable, then learn that cleaned versus uncleaned rows tell different stories.
In 2026 agent systems, data cleaning is harder because the unit of data is no longer a single input/output pair. A trace may contain user intent, retrieved chunks, intermediate tool calls, memory reads, final answer, human feedback, and escalation outcome. If any link is malformed, downstream evaluators can reward the wrong behavior. Common symptoms include duplicate trace IDs, inconsistent enum values, empty reference contexts, impossible timestamps, and sudden score jumps after an ETL job.
How FutureAGI Handles Data Cleaning
FutureAGI’s approach is to treat cleaning as a dataset workflow, not a one-off notebook. The anchor surface for this entry is sdk:Dataset, exposed in the SDK as fi.datasets.Dataset. A team imports CSV, JSON, Hugging Face data, or sampled traces into a Dataset, adds columns such as prompt, expected_response, reference_context, expected_tool, source, row_status, and data_version, then attaches evaluations and eval stats to the same object.
A concrete workflow: a support agent team samples failed traceAI-langchain traces into a FutureAGI Dataset before a model migration. Their cleaning pass removes duplicate trace IDs, normalizes intent labels, fills missing reference_context, and marks ambiguous rows as needs_review instead of letting them bias the eval. They then run FieldCompleteness to catch missing required fields, TypeCompliance to catch schema drift, and GroundTruthMatch on cleaned rows with approved references. The release gate is simple: no migration if completeness drops below 0.99 or if the cleaned billing_dispute cohort regresses more than 2 percentage points.
Unlike Great Expectations or Pandera, which are strong for tabular validation, FutureAGI keeps the cleaned dataset connected to LLM outputs, traces, evaluator results, and regression history. The engineer’s next action is not “the dataframe passed”; it is alert, send suspect rows to annotation, rerun the regression eval, or block the prompt/model change.
How to Measure or Detect Data Cleaning Problems
Measure data cleaning by tracking the defects it removes and the eval variance it reduces:
- Null and missing-field rate: share of rows missing required columns such as
expected_response,reference_context,expected_tool, or cohort metadata. - Duplicate-rate by key: duplicate
trace_id, prompt hash, document ID, or user-feedback event. Duplicates skew regression results. FieldCompletenessscore: checks whether required structured fields are present before a row enters an eval run.TypeCompliancescore: catches schema drift, such as strings in numeric score columns or invalid enum values.- Eval-fail-rate-by-cohort after cleaning: compare scores before and after cleaning; large movement means prior evals were polluted.
- User-feedback proxy: watch thumbs-down rate and escalation rate for cohorts whose rows were recently cleaned or relabeled.
from fi.evals import FieldCompleteness, TypeCompliance
row = {"prompt": prompt, "expected_response": answer, "metadata": metadata}
field_result = FieldCompleteness(fields=["prompt", "expected_response"]).evaluate(row)
type_result = TypeCompliance(schema=dataset_schema).evaluate(row)
print(field_result.score, type_result.score)
Common Mistakes
Engineers usually clean too late, too broadly, or without preserving the original evidence.
- Deleting hard rows instead of quarantining them. Hard rows often contain the failure modes your eval suite needs most.
- Cleaning labels without a version bump. If
data_versiondoes not change, old and new pass rates become impossible to compare. - Normalizing away cohort signals. Free-text locale, channel, or customer-tier values may look messy but explain score movement.
- Deduplicating only exact prompts. Agent traces can duplicate the same failure with different phrasing, tools, or retrieved chunks.
- Mixing training and eval cleanup. A row fixed for fine-tuning may be invalid for holdout evaluation because it leaks expected answers.
Frequently Asked Questions
What is data cleaning in ML?
Data cleaning in ML means finding, fixing, or removing inaccurate, duplicated, missing, mislabeled, or malformed records before data is used for training, evaluation, retrieval, or monitoring.
How is data cleaning different from data quality?
Data cleaning is the work done to repair bad records. Data quality is the measurable state of the dataset after checks such as completeness, consistency, duplication, and label correctness.
How do you measure data cleaning?
In FutureAGI, store rows in fi.datasets.Dataset and run evaluators such as FieldCompleteness, TypeCompliance, and GroundTruthMatch. Track duplicate-rate, null-rate, and eval-fail-rate-by-cohort before and after cleaning.