Infrastructure

What Is Distilling Large Language Models (LLMs)?

The practice of training a smaller student LLM to imitate a larger teacher LLM by matching its logits, generated text, or chain-of-thought traces.

What Is Distilling Large Language Models (LLMs)?

Distilling large language models (LLM distillation) is the practice of training a smaller student model to imitate a larger, more expensive teacher model. The student matches one or more of: the teacher’s output logits (Hinton-style soft-target distillation), the teacher’s generated text (sequence-level distillation), or its chain-of-thought traces (reasoning distillation, common in 2026 with open-weight reasoning models). The result is a smaller model — often 10× to 100× smaller — with most of the teacher’s task quality at a fraction of the inference cost. Distillation is a training technique; FutureAGI evaluates the resulting student.

Why It Matters in Production LLM and Agent Systems

Inference cost is the most-cited reason teams distil LLMs in 2026. A frontier teacher might cost $15 per million input tokens; a well-distilled student covering the same product surface can cost $0.50. For high-volume agents — customer support, search assistants, code review bots — that gap is the difference between a viable feature and a money-losing one.

The pain shows up across roles. Finance and platform leads see API bills that scale linearly with usage. ML engineers see distillation runs that look fine on benchmarks but regress on long-tail user prompts. Product managers see worse user experiences when the student silently drops a capability the teacher had. SREs see latency improve while quality regressions sneak through under the noise floor of aggregate evals.

In 2026-era agent stacks the stakes are higher. Distilled students are routed by Agent Command Centers as the default for cheap routes, with the teacher reserved for fallback or hard cases. A regression in the student’s tool-call accuracy or chain-of-thought reasoning quality directly affects every cheap route, and the regression often hides under aggregate scores because the student is fine on common prompts and drops only on the long tail. Cohort-segmented eval is essential.

How FutureAGI Handles Distillation Evaluation

FutureAGI’s surface for distillation is paired-output regression eval. The teacher and student are both run against the same Dataset rows; outputs are stored side-by-side as columns; EmbeddingSimilarity measures the closeness of student to teacher per row. Task-quality evaluators — Faithfulness for RAG, TaskCompletion for agents, FactualAccuracy for QA — are run on both, and the per-cohort delta tells you exactly where the student lost capability.

The Agent Command Center makes this actionable in production. traffic-mirroring sends a percentage of traffic to both the teacher and the student; the responses are evaluated asynchronously and compared. When eval-fail-rate-by-cohort shows the student regressed on a specific cohort, model fallback keeps the teacher in the loop for that route while the team retrains the student. cost-optimized routing sends easy cases to the student and hard cases (detected via ContextRelevance or ReasoningQuality) to the teacher. Unlike a static benchmark like MMLU, this is production-shaped eval — the student is judged on the queries it actually serves, not on academic test sets.

How to Measure or Detect It

Useful FutureAGI signals for distilled-student evaluation:

  • EmbeddingSimilarity — student-vs-teacher response similarity per row.
  • Faithfulness — RAG faithfulness comparison across teacher and student.
  • TaskCompletion — agent task-completion comparison.
  • FactualAccuracy — factual correctness comparison.
  • ReasoningQuality — reasoning-quality delta on chain-of-thought tasks.
  • eval-fail-rate-by-cohort — segments where the student regressed.
  • Inference-cost dashboards — confirm the cost win the distillation was supposed to deliver.

Minimal Python:

from fi.evals import EmbeddingSimilarity, Faithfulness

similarity = EmbeddingSimilarity()
faithfulness = Faithfulness()

result = similarity.evaluate(
    input=user_query,
    output=student_response,
    context=teacher_response,
)

Common Mistakes

  • Benchmark-only evaluation. Public benchmarks miss the long tail your users care about; eval on a production-shaped Dataset.
  • Skipping reasoning distillation for agents. Logits-only distillation loses chain-of-thought quality; for agentic students, distill on reasoning traces.
  • No regression eval per checkpoint. Distillation runs are stochastic; rerun the eval suite per checkpoint to track quality drift.
  • One-shot temperature. Distillation usually wants a non-default soft-target temperature; tune it per task.
  • Ignoring cohort splits. Distilled students often regress unevenly — strong on common prompts, weak on rare ones; segment evals by cohort.
  • Skipping safety evals on the student. Distillation can regress safety behaviour the teacher had; rerun ContentSafety, Toxicity, and BiasDetection.
  • Distilling on synthetic prompts only. Production traffic is the truth set; mix real anonymised traces into the distillation prompt distribution.

Frequently Asked Questions

What is LLM distillation?

LLM distillation is the practice of training a smaller student model to imitate a larger teacher model by matching its output logits, generated text, or chain-of-thought traces, producing a smaller, faster, and cheaper model with most of the teacher's quality.

How is distillation different from fine-tuning?

Fine-tuning trains a model on labelled task data. Distillation trains a student to imitate a teacher's outputs across many prompts, with no human labels — the teacher is the label source. Distillation can be combined with fine-tuning.

How do you evaluate a distilled student model?

Run regression evals on a shared `Dataset` comparing teacher and student outputs with `EmbeddingSimilarity`, `Faithfulness`, and `TaskCompletion`, and segment by cohort.