Infrastructure

What Is Model Distillation?

A training technique where a smaller student model learns to imitate a larger teacher model's behavior, preserving quality at lower inference cost.

What Is Model Distillation?

Model distillation is a training technique where a smaller student model is trained to imitate a larger teacher model. Instead of learning from raw labels, the student is trained on (input, teacher-output) pairs — and in soft-label distillation, on the teacher’s full token-level probability distribution. The objective is to compress most of the teacher’s task-specific quality into a model that is cheaper and faster to serve. In LLM infrastructure, distillation is how teams take GPT-4-class behavior and fit it into a 7B-13B open-weight model that runs on a single GPU, often with 5-20× lower per-token cost.

Why It Matters in Production LLM and Agent Systems

Inference cost dominates the LLM bill once a product is past prototype. A 70B-class teacher model at frontier-provider pricing can eat the unit economics of a chatbot at scale; a 7B distilled student running on vllm or ollama infrastructure can drop cost-per-trace by an order of magnitude. Distillation is the technique that makes that swap viable without a quality cliff.

The pain hits when teams skip the trade-off measurement. An ML engineer distills a model, sees the global eval score drop two points, and ships — only to discover the regression is concentrated on the highest-value cohort: enterprise prompts in a regulated domain. A platform engineer reduces inference cost by 80% but loses 12% of TaskCompletion rate on agent flows, breaking SLAs. A product manager picks a “distilled GPT-4” model from a third party with no eval against their domain.

In 2026-era stacks, distillation is the workhorse for the ladder of model sizes you actually deploy: a frontier teacher for the hardest 5% of traffic, a distilled mid-size for 80%, and a smaller model for cached or simple paths. Routing across that ladder via a gateway is straightforward; the engineering risk is making sure each rung was distilled and validated on representative production data, not on a generic distillation corpus that doesn’t match your traffic.

How FutureAGI Handles Model Distillation

FutureAGI does not run distillation training itself — that lives in your training stack on top of huggingface, vllm, or a managed service. FutureAGI sits at the layer where the distilled student model meets your eval contract and your production traces. The workflow is: register the teacher and the student as named models in the Agent Command Center, route a controlled cohort of traffic to each, and run the same evaluator suite (Groundedness, TaskCompletion, FactualAccuracy, JSONValidation) against both via Dataset.add_evaluation() on a pinned golden dataset.

FutureAGI’s approach is to make the distillation trade-off explicit and per-cohort. The output of a distillation run is not “score dropped 2 points” but a table: per cohort, per evaluator, the teacher score, the student score, the delta, and the inference-cost ratio. Compared to evaluating distillation in isolation via lm-evaluation-harness, FAGI ties the eval to the same traceAI spans your production traffic flows through, so the signal is on real prompts, not benchmark prompts.

Concretely: a team distills claude-3-5-sonnet behavior into a llama-3.1-8b student via supervised fine-tuning on 50K teacher-generated traces. They route 5% of traffic to the student via weighted routing, run Groundedness and TaskCompletion online via traceAI sampling, and gate the full rollout on (a) per-cohort regression under 3% and (b) inference-cost ratio under 0.15. The Agent Command Center handles the routing; fi.evals handles the gate.

How to Measure or Detect It

A distillation rollout is measured along two axes — quality and cost — and you need both:

  • Per-cohort regression delta: the change in evaluator score on each labelled cohort, student vs. teacher; alert when any cohort regresses more than the global mean.
  • fi.evals.Groundedness and fi.evals.TaskCompletion on the student against the same golden dataset the teacher was scored on.
  • Inference-cost ratio: dollars-per-trace for student divided by dollars-per-trace for teacher; the headline cost win.
  • Latency p50 and p99: distilled models running on smaller hardware often have very different latency curves; measure both.
  • Token-distribution divergence (research signal): KL divergence between student and teacher logits on a held-out set — useful during training, less useful in production.

Minimal Python:

from fi.evals import Groundedness, TaskCompletion
from fi.datasets import Dataset

ds = Dataset.from_id("rag-golden-v4")
ds.add_evaluation(Groundedness(), model="llama-3.1-8b-distilled")
ds.add_evaluation(TaskCompletion(), model="llama-3.1-8b-distilled")

Common Mistakes

  • Distilling on a generic corpus and shipping to a domain. The student inherits only the behaviors covered in the distillation data; if your traffic is medical and the corpus was open web, expect regressions on jargon.
  • Trusting global score, ignoring cohorts. A 1% global drop can be a 20% drop on the highest-value 5% of users. Slice scores per cohort.
  • Skipping the inference-cost measurement. “It’s smaller” is not a cost claim; measure dollars-per-trace including the new hardware’s amortised cost.
  • Distilling once, never re-distilling. When the teacher is updated or production traffic shifts, the student goes stale. Re-distill on rolling production samples.
  • Confusing distillation with quantisation. Quantisation reduces precision of an existing model; distillation trains a new one. Different costs, different risks.

Frequently Asked Questions

What is model distillation?

Model distillation is a training technique where a smaller student model is trained on the outputs (and sometimes token distributions) of a larger teacher model, so the student matches most of the teacher's quality at far lower inference cost.

How is model distillation different from fine-tuning?

Fine-tuning trains a model on labeled task data. Distillation trains a smaller model on a larger model's outputs as the supervision signal — the teacher generates the labels, so the student inherits the teacher's behavior, not just task-specific patterns.

How do you measure whether distillation worked?

Run the same evaluator suite (Groundedness, TaskCompletion, FactualAccuracy) on student and teacher against a versioned golden dataset, and report the per-cohort delta plus the inference-cost ratio. FutureAGI persists both for the trade-off audit.