What are features in machine learning?

Features are input variables or representations a model uses to make predictions. In LLM and agent systems, prompts, retrieved context, tool outputs, user metadata, and memory act like features because they condition the next model call.

How are features different from feature engineering?

Features are the actual signals the model receives. Feature engineering is the work of creating, transforming, selecting, and validating those signals before training or inference.

What is the LLM equivalent of features?

The LLM equivalent is the prompt, retrieved context, tool outputs, user metadata, and memory. FutureAGI uses `ChunkAttribution`, `EmbeddingSimilarity`, and trace attributes such as `llm.token_count.prompt` to inspect those inputs.

Feature in Machine Learning: FutureAGI Guide (2026)

What Is a Feature?

A feature is the input variable or representation a machine learning model uses to make a prediction. In tabular systems, features are encoded columns such as age, country, or account balance; in NLP, they are tokens, embeddings, or learned activations. In production LLM and agent systems, the feature-like inputs are prompts, retrieved context, tool outputs, user metadata, and memory that shape the next model call. FutureAGI treats those inputs as traceable reliability signals because bad or shifted features usually become bad predictions.

Why Features Matter in Production LLM and Agent Systems

Most production model failures trace back to features, not to the model itself. A categorical encoder seeing a new category. A numerical feature whose distribution shifted because of a frontend change. An embedding whose underlying model was upgraded without the vector store being re-embedded. A retrieval pipeline returning the wrong chunks. The model is doing exactly what it was trained to do; the features changed.

The pain falls on different roles. ML engineers debug a 5-point accuracy drop and trace it to a feature pipeline that started emitting NaNs after an upstream schema change. SREs see latency spikes when a feature lookup hits a cold cache. Data scientists discover a feature they thought was important is actually leakage. Compliance teams reading the EU AI Act’s data-governance article are asked to document the provenance of every feature and find the lineage was never logged.

In 2026-era stacks the surface broadens. An LLM agent’s “features” are the system prompt, the user query, retrieved chunks from a vector store, tool outputs, and conversational memory. Each is a separate failure surface. A RAG hallucination is often a feature problem: the retriever returned the wrong chunks. A tool-call error is often a feature problem: the prompt did not include the schema. FutureAGI’s trace surface captures standard span attributes such as llm.token_count.prompt and gen_ai.request.model, plus retrieved-chunk and tool-output fields, so feature-level debugging stays available even when the “features” are passed through prompt strings.

How FutureAGI Measures and Debugs Features

FutureAGI’s approach is to make features observable at every layer where they matter. For tabular models wrapped through Dataset.add_evaluation, every feature column is logged with its production distribution; drift checks against the training distribution surface as eval-fail-rate-by-cohort segments. For LLM workflows, the traceAI llamaindex integration captures prompts, retrieved chunks, and tool outputs as structured span data, so each feature-like input is queryable in the trace dashboard.

A concrete workflow: a RAG team running on traceAI llamaindex sees Faithfulness drop on a specific user cohort. The team pivots into tracing, filters to that cohort, and inspects the retrieved chunks: the features the LLM saw. They find the retriever is returning a stale boilerplate document for queries containing a specific phrase. They run ChunkAttribution to confirm the stale chunk is being cited even though it does not contain the answer, then add a metadata filter to the retriever. A regression run with RAGFaithfulness confirms recovery. Unlike a standalone Ragas faithfulness score, the trace keeps the input chunks, cohort, and downstream eval result in the same incident view. For a tabular fraud-detection model, the same workflow compares production feature distributions with training baselines. Categorical novelty triggers an alert when production data starts containing values the training set never saw. Agent Command Center can also use feature metadata in a routing policy: cost-optimized, sending simple feature patterns to a smaller model and reserving the heavyweight model for hard cases. We’ve found that “the model is broken” is wrong four times out of five: the features changed and nobody noticed.

How to Measure or Detect Features

Feature health is measured through distribution and lineage signals:

EmbeddingSimilarity: production-time versus reference-time embedding similarity for textual features.
ChunkAttribution: which retrieved chunks, or RAG features, actually drove the response.
Per-feature distribution drift: PSI or KL divergence per column versus training baseline.
Categorical novelty rate: proportion of unseen categories per release.
Trace attributes: llm.token_count.prompt, gen_ai.request.model, source service, version, and timestamp per feature-like input.
Null-rate alerts: per-feature null rate spike often precedes accuracy drop.
eval-fail-rate-by-cohort segmentation: failures grouped by feature pattern surface broken pipelines.

from fi.evals import EmbeddingSimilarity, ChunkAttribution

sim = EmbeddingSimilarity()
attr = ChunkAttribution()
print(sim.evaluate(input="...", output="...").score)
print(attr.evaluate(input="...", output="...", context=[...]).score)

Common mistakes

Treating features as static. Production distributions shift after UI, provider, retrieval, or policy changes; monitor drift per feature, not only aggregate model score.
No lineage on derived features. A computed feature whose upstream dependency changed is invisible unless the source service, version, and transform are logged.
Mixing training and serving preprocessing. Train and serve must use the same transformation code path; offline parity checks cannot fix divergent production encoders.
For LLMs, ignoring retrieved context as a feature. A RAG hallucination is usually caused by bad retrieval inputs, not mysterious model behavior.
Logging only model output. Without feature and context logs, incident review can explain that the answer was wrong but not why it happened.