What is preprocessing in machine learning and LLM workflows?

Preprocessing is the transformation of raw inputs into model-ready data — covering tokenization, normalization, chunking, scaling, encoding, and redaction — before training, retrieval, or inference.

How is preprocessing different from feature engineering?

Preprocessing covers generic input cleaning and formatting that any model needs (tokenization, scaling, deduplication). Feature engineering is the task-specific construction of new signals (ratios, interactions, embeddings) on top of cleaned inputs.

How does FutureAGI handle preprocessing artifacts?

FutureAGI does not run preprocessing pipelines. We evaluate downstream outputs through fi.evals — for example ChunkAttribution and ChunkUtilization score whether your chunking config produced retrievable, useful units.

What Is Preprocessing? Definition & FutureAGI Guide (2026)

What Is Preprocessing?

Preprocessing is the set of transformations applied to raw inputs before they enter a model — text, images, audio, or tabular features. For LLM and RAG stacks, this includes tokenization, normalization, deduplication, PII redaction, chunking, embedding generation, and prompt formatting. For classical ML it includes scaling, one-hot encoding, missing-value imputation, and feature selection. Preprocessing is the silent contract between data and model: the same input under two different preprocessing configs produces two different predictions, and most production drift starts here, not at the model weights themselves.

Why It Matters in Production LLM and Agent Systems

Preprocessing failures rarely throw exceptions. They corrupt outputs quietly. A RAG pipeline that switched from RecursiveCharacterTextSplitter chunk size 500 to 1024 last week now retrieves chunks that exceed the context window of the reranker — recall drops 11%, but the eval suite was running on a frozen Dataset from before the change, so nothing flagged. A tokenization mismatch between training and serving (a fine-tuned model trained with a custom tokenizer, served with the base tokenizer) silently degrades every response. A normalization step that lowercases case-sensitive identifiers (function names, SKU codes) before embedding turns retrieval into a coin flip on technical queries.

The pain is distributed. ML engineers see eval scores degrade with no model change. Product teams see “the chatbot suddenly forgets common questions.” SREs see no infrastructure regression — preprocessing runs on CPU and looks healthy. Compliance leads find PII in production traces because a redaction step ran on the user message but not on a tool output.

In 2026 multi-agent stacks, preprocessing happens in more places: at ingestion into a KnowledgeBase, at retrieval-time chunk selection, at tool-output normalization, and at prompt-template compilation. Every layer has its own preprocessing config, and an inconsistency at any layer poisons the trace. This is why FutureAGI treats preprocessing artifacts — chunks, embeddings, prompts — as first-class objects to evaluate.

How FutureAGI Handles Preprocessing Artifacts

FutureAGI does not run your preprocessing pipeline. We evaluate the artifacts it produces. The connection is concrete on three surfaces.

Chunking outputs. A team ingests a 10K-document corpus into a fi.kb.KnowledgeBase with a chunk size of 800 tokens and 100-token overlap. FutureAGI’s ChunkAttribution evaluator scores how much of each retrieved chunk contributed to the final answer; ChunkUtilization scores how many of the retrieved chunks were used. A ChunkUtilization below 0.4 across a cohort tells you preprocessing produced too many low-relevance chunks per query — the engineer reduces overlap or switches to semantic chunking and re-evaluates against the same Dataset.

Tokenization and prompt format. When Prompt.commit() versions a template, FutureAGI tracks the rendered token count alongside the prompt id. Spikes in llm.token_count.prompt between two prompt versions reveal a preprocessing change (e.g. extra system-prompt prefix) that the engineer can roll back.

PII redaction. fi.evals.PII runs against llm.input.messages and tool.output to flag whether a redaction preprocessing step actually stripped sensitive fields. If PII returns a positive score on supposedly-redacted input, the upstream redactor missed a pattern.

FutureAGI’s approach is to separate the preprocessing logic (yours) from the preprocessing audit (ours). Unlike Ragas, which evaluates only the answer, the FutureAGI eval stack scores every artifact in the path — chunks, prompts, embeddings, redacted inputs — so a preprocessing regression is visible at the layer where it happened.

How to Measure or Detect Preprocessing Issues

Track preprocessing through outcome metrics on the artifacts it produces:

ChunkAttribution / ChunkUtilization: returns 0–1 scores on retrieved chunks; flags chunking-config drift.
Token-count delta per prompt version: llm.token_count.prompt plotted by prompt.id reveals silent template growth.
PII evaluator: scores remaining PII in supposedly-redacted strings; surfaces missed redaction patterns.
Embedding-similarity stability: re-embed a fixed eval set after a preprocessing change; cosine drift > 0.05 against the prior embedding indicates a tokenizer or normalization shift.
Eval-fail-rate-by-corpus-version: tag every run with the preprocessing config hash; regression eval surfaces which corpus version broke retrieval.

from fi.evals import ChunkUtilization, PII

util = ChunkUtilization()
result = util.evaluate(
    input="What was Q3 revenue?",
    output="Q3 revenue was $42M.",
    context=["Q3 revenue: $42M.", "Marketing spend Q3: $5M.", "Headcount: 240."]
)
print(result.score, result.reason)

Common Mistakes

Changing chunk size without re-running golden-dataset eval. A single config change invalidates every prior eval result.
Mixing tokenizers between training, fine-tuning, and serving. Fine-tuned weights expect the tokenizer they were trained with; using a different one silently mangles attention patterns.
Redacting only the user message. PII in retrieved documents, tool outputs, and model responses bypasses input-side redaction.
Treating preprocessing as a one-time setup. Production data drifts; preprocessing configs need versioning and re-eval like any other dependency.
Lowercasing case-sensitive identifiers. Code symbols, SKU codes, and acronyms become ambiguous after normalization, hurting both retrieval and generation.