Models

What Is an AI/ML Experiment?

A tracked model, prompt, or pipeline run against a fixed dataset and evaluator suite for comparison.

What Is an AI/ML Experiment?

An AI/ML experiment is a tracked run of a model, prompt, retriever, agent, or full pipeline against a fixed dataset and evaluator suite. It belongs to model operations because it records the configuration that changed, the inputs used, the outputs produced, and the scores returned. In FutureAGI, an experiment is usually a versioned Dataset run with Dataset.add_evaluation() results attached, so two prompt, model, or RAG variants can be compared row by row.

Why AI/ML experiments matter in production LLM and agent systems

Without experiments, every change to an AI system is a leap of faith. A prompt edit “feels better”; a model swap “looks good in the demo”; a new retriever “seems to retrieve the right docs.” None of those are decisions you can audit or roll back. The pain falls across roles. An ML engineer tries three prompt variations and ships the one with the best memory of yesterday’s outputs. A product manager green-lights a model swap because two cherry-picked examples improved. A compliance team is asked, “what evaluation did this release pass?” and has no run-versioned answer.

Common production symptoms include: regressions that ship because the change passed a five-example demo and failed nothing measurable; “improvements” that win on engineer-curated prompts and lose on real production traffic; A/B tests that conclude nothing because the noise floor of LLM evals is higher than the gap between candidates.

Unlike a product A/B test, which measures live user allocation after a release, an AI/ML experiment should gate the candidate before it reaches users.

In 2026-era stacks, the situation is worse because pipelines have many movable parts. A “single change” can touch a prompt, a model, a retriever, a guardrail, and a tool definition. Without experiment tracking, attributing the regression — or the win — to the right change is impossible. Multi-step agentic systems amplify this: 80% of regressions surface at step three of a five-step trajectory, not in the final answer.

How FutureAGI handles AI/ML experiments

FutureAGI’s approach is to make every change a tracked experiment by default. The Dataset is the eval substrate: a versioned collection of inputs and (optionally) ground-truth outputs that every experiment runs against. Dataset.add_evaluation() attaches one or more evaluators (TaskCompletion, Groundedness, AnswerRelevancy, FactualConsistency) to the run and stores scores per row. Configuration is logged: the prompt template (versioned through fi.prompt.Prompt), the model id (gen_ai.request.model), retrieval parameters, and tool definitions land alongside the eval results so a future engineer can reproduce or roll back. Regression eval runs the canonical golden dataset against every release experiment, so a candidate that wins on the experimental cohort but loses on the canonical set is rejected.

A practical pattern: a RAG team has three experiments queued — a chunking-strategy change, a reranker swap, and a prompt rewrite. They run each as a separate Dataset.add_evaluation() run with Groundedness, ContextRelevance, and AnswerRelevancy. The results show the reranker swap wins on ContextRelevance (+0.12) but loses on Groundedness (-0.08) because reranked chunks include longer, harder-to-cite passages. The prompt rewrite wins on Groundedness (+0.07) without regressions. The team ships the prompt rewrite through Agent Command Center, leaves the reranker as a follow-up experiment, and stores both runs as audit evidence. Unlike comparing to yesterday’s spot-check, the experiment is a deterministic, comparable artifact.

How to measure or detect experiment quality

Experiments are the measurement vehicle; the signals you compare are the evaluator outputs:

  • TaskCompletion: returns 0–1; the canonical end-task signal for agent or task experiments.
  • Groundedness: returns 0–1 for context-anchored answers; canonical for RAG experiments.
  • AnswerRelevancy: returns 0–1 for query-response alignment; canonical for chat experiments.
  • Per-cohort delta (dashboard signal): the change in evaluator score between two experiments, sliced by cohort — a wins-some-loses-some experiment is rarely worth shipping.
  • Regression-against-golden delta: the change versus the canonical golden dataset — the gating signal for a release decision.

Minimal Python:

from fi.evals import TaskCompletion, Groundedness

task = TaskCompletion()
ground = Groundedness()
for variant in [prompt_v1, prompt_v2, prompt_v3]:
    outputs = run(variant, dataset)
    print(variant.name,
          mean(task.evaluate(o).score for o in outputs),
          mean(ground.evaluate(o).score for o in outputs))

Common mistakes

  • Comparing experiments on different datasets. Eval scores are only meaningful against the same input rows; pin the Dataset version.
  • Skipping ground-truth labels for tasks that have them. Reference-free metrics are useful but reference-based metrics give cleaner signal when labels exist.
  • One run per experiment. LLM evals are noisy; run at least three seeds for the headline number.
  • Cherry-picking the demo prompt. A single example is an anecdote; ship on cohort-aggregate evidence.
  • No regression eval against the canonical set. A win on the experiment dataset that loses on canonical is a regression in disguise.

Frequently Asked Questions

What is an AI/ML experiment?

An AI/ML experiment is a tracked run of a model, prompt, retriever, or full pipeline against a fixed dataset and evaluator suite, with inputs, outputs, parameters, and scores logged so teams can compare runs.

How is an experiment different from a deployment?

An experiment is an offline or shadow run for comparison and gating; a deployment is the live serving of a chosen experiment's configuration. Most experiments do not deploy; the chosen one becomes the live system.

How do you run AI/ML experiments in FutureAGI?

FutureAGI's Dataset and fi.evals stack runs each experiment against the same versioned dataset, attaches evaluators like TaskCompletion and Groundedness, and stores comparable scores for regression-aware release decisions.