What Is CI/CD for Machine Learning?
The practice of automating testing, evaluation, and deployment of machine learning and LLM systems on every change to code, prompts, data, or model weights.
What Is CI/CD for Machine Learning?
CI/CD for machine learning is the practice of running automated tests, evaluation suites, and deployment pipelines on every change to ML or LLM code, prompts, datasets, or weights. It extends classical CI/CD — git push triggers tests, then deploy — with ML-specific gates: data validation, training reproducibility checks, eval regression scoring, model-card updates, and progressive rollouts via shadow or canary deployments. The point is to catch silent quality regressions before users hit them, because ML systems can pass syntax checks while quietly losing accuracy or fairness.
Why It Matters in Production LLM and Agent Systems
Without ML CI/CD, the only feedback loop is user complaints — and most users do not complain. A prompt change that drops AnswerRelevancy by six points can ship to production with a green build, because pytest does not know about answer relevancy. The cost of detecting that drop a week later in a quarterly business review is much higher than catching it on the PR.
The pain shows up across roles. An ML engineer fine-tunes a router LLM on a new dataset, accuracy on the validation slice ticks up, they merge — and a long-tail intent cohort that wasn’t represented in the validation slice loses 19 points of TaskCompletion. A platform engineer pushes a refactor that swaps claude-3-5-sonnet for claude-sonnet-4; the API call works, the JSON parses, but the new model’s verbose tone breaks a downstream regex. A compliance lead is asked to certify “no regression in fairness metrics across the last 12 deploys” and has no audit trail.
In 2026-era agent stacks, the surface area is bigger. A single PR can change a prompt template, a tool schema, a retriever chunker, and the model version — each one a potential silent regression. ML CI/CD makes those regressions blocking, not informational.
How FutureAGI Handles CI/CD for Machine Learning
FutureAGI’s approach is to make evaluation a first-class CI/CD primitive. The fi.evals package is callable from any pipeline runner — GitHub Actions, GitLab CI, Jenkins, Argo — and writes results to a versioned Dataset. A typical PR-gated eval looks like this: pull the latest Dataset (your golden test set), run the candidate model or prompt over it, attach evaluators (Groundedness, AnswerRelevancy, TaskCompletion, custom rubrics), aggregate the scores, fail the build if any metric drops more than the configured threshold versus the production baseline.
Concretely: a RAG team gates every prompt-template PR on a CI job that runs Dataset.add_evaluation(Groundedness()) against a 500-row golden set. The job emits an artifact with score deltas; if Groundedness drops more than two points or AnswerRelevancy drops more than three, the GitHub status check is red and the PR cannot merge. Once green, the deploy enters a traffic-mirroring phase via the Agent Command Center — the new prompt sees 5% of production traffic in shadow mode, with the same evaluators running online. If shadow metrics hold, the rollout proceeds to canary; if they regress, the gateway falls back to the previous prompt automatically.
For agent stacks, pair this with agent.trajectory.step traces so the CI artifact shows where in a multi-step trajectory the regression appears, not just that an end-to-end score moved.
How to Measure or Detect It
ML CI/CD health is measured by both pipeline metrics and downstream evaluation:
- Eval-suite regression delta: per-evaluator score change versus the production baseline. Threshold and alert.
Groundedness/TaskCompletion: the typical PR-gate evaluators for RAG and agent systems.- Build-time eval latency: how long the eval suite takes per PR; cap it or developers stop running it.
- Shadow-deploy fail rate: percentage of canary or shadow rollouts that regress production metrics; if rising, your offline eval is too lenient.
- Coverage of golden dataset: does it represent every cohort that hits production? A passing eval on a stale dataset is theatre.
from fi.evals import Groundedness, TaskCompletion
from fi.datasets import Dataset
dataset = Dataset.get("rag-golden-v9")
dataset.add_evaluation(Groundedness())
dataset.add_evaluation(TaskCompletion())
report = dataset.run_evaluations(model=candidate_model)
assert report["Groundedness"].mean >= 0.78, "Groundedness regression"
Common Mistakes
- Treating CI/CD as deploy-only. Skipping the eval-gate at PR time means regressions land in main and propagate through every downstream branch.
- Using a stale golden dataset. Production traffic shifts weekly; rotate fresh production traces into the dataset on a schedule.
- Failing builds on a single noisy metric. LLM-as-a-judge scores have variance; require thresholds across at least two metrics or use bootstrap CI.
- Skipping shadow deploys for “small” prompt changes. Small changes are exactly the ones that pass offline eval and break in production.
- Letting the same model serve as judge and candidate. Self-evaluation inflates pass rates; pin the judge to a different model family.
Frequently Asked Questions
What is CI/CD for machine learning?
It is the practice of running automated tests, evaluation suites, and deployment pipelines on every change to ML code, prompts, datasets, or weights — the ML extension of classical software CI/CD.
How is ML CI/CD different from software CI/CD?
Software CI/CD checks deterministic correctness with unit tests; ML CI/CD adds non-deterministic gates — eval-suite regression scores, drift checks, fairness metrics, and shadow deploys — because ML systems can pass syntax tests while silently regressing on quality.
How do you implement CI/CD for an LLM application?
Wire an evaluation framework like FutureAGI's fi.evals into your pull-request pipeline. On every PR, run regression evals against a Dataset, fail the build if metric thresholds drop, and gate the canary rollout on the same metrics in production.