What Is Model Distillation?
A training technique where a smaller student model learns to imitate a larger teacher model's behavior, preserving quality at lower inference cost.
What Is Model Distillation?
Model distillation is a training technique where a smaller student model is trained to imitate a larger teacher model. Instead of learning from raw labels, the student is trained on (input, teacher-output) pairs. and in soft-label distillation, on the teacher’s full token-level probability distribution. The objective is to compress most of the teacher’s task-specific quality into a model that is cheaper and faster to serve. In LLM infrastructure, distillation is how teams in 2026 take frontier-class behavior. Claude Opus 4.7, GPT-5.x, Gemini 3 Ultra. and fit it into a 7B-13B open-weight model that runs on a single GPU, often with 5-20× lower per-token inference cost.
Why It Matters in Production LLM and Agent Systems
Inference cost dominates the LLM bill once a product is past prototype. A 70B-class teacher model at frontier-provider pricing can eat the unit economics of a chatbot at scale; a 7B distilled student running on vLLM or Ollama infrastructure can drop cost-per-trace by an order of magnitude. Distillation is the technique that makes that swap viable without a quality cliff.
The pain hits when teams skip the trade-off measurement. An ML engineer distills a model, sees the global eval score drop two points, and ships. only to discover the regression is concentrated on the highest-value cohort: enterprise prompts in a regulated domain. A platform engineer reduces inference cost by 80% but loses 12% of TaskCompletion rate on agent flows, breaking SLAs. A product manager picks a “distilled” model from a third party with no eval against their domain.
In 2026-era stacks, distillation is the workhorse for the ladder of model sizes you actually deploy: a frontier teacher (Claude Opus 4.7, GPT-5.x) for the hardest 5% of traffic, a distilled mid-size (Llama 4 70B, Gemini 3 Flash) for 80%, and a smaller model for cached or simple paths. Routing across that ladder via an LLM gateway is straightforward; the engineering risk is making sure each rung was distilled and validated on representative production data, not on a generic distillation corpus that doesn’t match your traffic.
How FutureAGI Handles Model Distillation
FutureAGI does not run distillation training itself. that lives in your training stack on top of Hugging Face, vLLM, or a managed service. FutureAGI sits at the layer where the distilled student model meets your eval contract and your production traces. The workflow is: register the teacher and the student as named models in Agent Command Center, route a controlled cohort of traffic to each, and run the same evaluator suite (Groundedness, TaskCompletion, FactualAccuracy, JSONValidation) against both via Dataset.add_evaluation() on a pinned golden dataset. The gateway lives at /platform/monitor/command-center and emits the routing trace alongside the eval result.
FutureAGI’s approach is to make the distillation trade-off explicit and per-cohort. The output of a distillation run is not “score dropped 2 points” but a scorecard: per cohort, per evaluator, the teacher score, the student score, the delta, and the inference-cost ratio.
| metric | teacher (Opus 4.7) | student (Llama 4 8B distilled) | delta |
|---|---|---|---|
Groundedness (policy QA) | 0.91 | 0.88 | -0.03 |
TaskCompletion (refund flow) | 0.84 | 0.79 | -0.05 |
JSONValidation pass rate | 0.99 | 0.97 | -0.02 |
| p99 latency (ms) | 2,800 | 540 | -2,260 |
| cost / 1K trace | $4.20 | $0.31 | -$3.89 |
Compared to evaluating distillation in isolation via lm-evaluation-harness, FAGI ties the eval to the same traceAI spans your production traffic flows through, so the signal is on real prompts, not benchmark prompts.
Concretely: a team distills Claude Sonnet 4.6 behavior into a Llama 4 8B student via supervised fine-tuning on 50K teacher-generated traces. They route 5% of traffic to the student via weighted routing, run Groundedness and TaskCompletion online via traceAI sampling, and gate the full rollout on (a) per-cohort regression under 3% and (b) inference-cost ratio under 0.15. The Agent Command Center handles the routing; fi.evals handles the gate.
How to Measure or Detect It
For public-benchmark calibration, the standard 2026 distillation evaluation suite combines MMLU-Pro (14K multi-domain questions, harder MMLU successor), GPQA Diamond (198 expert-validated graduate-level science questions), and MATH-500. Distilled 7B-13B students typically retain 70–85% of their frontier teacher’s MMLU-Pro accuracy and 50–70% of GPQA Diamond. useful as a sanity floor, but never as the release gate; your cohort-level production deltas are. A distillation rollout is measured along two axes. quality and cost. and you need both:
- Per-cohort regression delta: the change in evaluator score on each labelled cohort, student vs. teacher; alert when any cohort regresses more than the global mean.
fi.evals.Groundednessandfi.evals.TaskCompletionon the student against the same golden dataset the teacher was scored on.- Inference-cost ratio: dollars-per-trace for student divided by dollars-per-trace for teacher; the headline cost win.
- Latency p50 and p99: distilled models running on smaller hardware often have very different latency curves; measure both.
- Token-distribution divergence (research signal): KL divergence between student and teacher logits on a held-out set. useful during training, less useful in production.
Minimal Python:
from fi.evals import Groundedness, TaskCompletion
from fi.datasets import Dataset
ds = Dataset.from_id("rag-golden-v4")
ds.add_evaluation(Groundedness(), model="llama-3.1-8b-distilled")
ds.add_evaluation(TaskCompletion(), model="llama-3.1-8b-distilled")
Distillation rollout playbook
A working rollout in 2026 follows a short sequence. First, freeze a baseline: 200-500 golden rows that the teacher already scores above the release threshold, with cohort labels for the highest-value 5% of traffic. Second, train the student against those teacher outputs. soft-label distillation when the provider exposes logprobs, hard-label otherwise. Third, run the same fi.evals suite on both teacher and student, segmented by cohort. Fourth, rollout a 5% canary through Agent Command Center weighted routing, with traceAI sampling on. Fifth, set a hard release gate: no global regression over 3%, no high-value cohort regression over 1%, and an inference-cost ratio below the target.
The trap most teams fall into is treating distillation as a one-time event. The teacher updates (Anthropic ships Opus 4.8; OpenAI rotates GPT-5.1 weights), the production traffic distribution shifts, and the student goes stale. We re-distill on rolling 60-day production samples and re-score the eval cohort on every cycle. Compared to a Hugging Face hub workflow that stops at “trained the student,” the FutureAGI loop closes by feeding live traces back into the next training corpus.
Common Mistakes
- Distilling on a generic corpus and shipping to a domain. The student inherits only the behaviors covered in the distillation data; if your traffic is medical and the corpus was open web, expect regressions on jargon.
- Trusting global score, ignoring cohorts. A 1% global drop can be a 20% drop on the highest-value 5% of users. Slice scores per cohort.
- Skipping the inference-cost measurement. “It’s smaller” is not a cost claim; measure dollars-per-trace including the new hardware’s amortised cost.
- Distilling once, never re-distilling. When the teacher is updated or production traffic shifts, the student goes stale. Re-distill on rolling production samples.
- Confusing distillation with quantisation. Quantisation reduces precision of an existing model; distillation trains a new one. Different costs, different risks.
Frequently Asked Questions
What is model distillation?
Model distillation is a training technique where a smaller student model is trained on the outputs (and sometimes token distributions) of a larger teacher model, so the student matches most of the teacher's quality at far lower inference cost.
How is model distillation different from fine-tuning?
Fine-tuning trains a model on labeled task data. Distillation trains a smaller model on a larger model's outputs as the supervision signal. the teacher generates the labels, so the student inherits the teacher's behavior, not just task-specific patterns.
How do you measure whether distillation worked?
Run the same evaluator suite (Groundedness, TaskCompletion, FactualAccuracy) on student and teacher against a versioned golden dataset, and report the per-cohort delta plus the inference-cost ratio. FutureAGI persists both for the trade-off audit.