Guides

Synthetic Data for LLM Fine-Tuning in 2026: Self-Instruct, Constitutional AI, DPO Data, and Function-Calling Traces

Generate synthetic data to fine-tune LLMs in 2026. Self-Instruct, Constitutional AI, DPO/IPO traces, function calling, and how to evaluate dataset quality.

·
Updated
·
12 min read
fine-tuning synthetic-data llms evaluation 2026
Synthetic Data
Table of Contents

A team needed 50,000 high-quality customer-support conversations to fine-tune an internal agent. The privacy team blocked the use of real tickets. The labeling budget covered 800 examples. A 2024-era team would have shipped a small model that generalized poorly. A 2026 team writes 200 seed conversations, expands them with Self-Instruct against GPT-5, generates DPO preference pairs with a Claude Opus 4.7 judge, runs a faithfulness and instruction-adherence pass over the synthetic rows, and trains a Llama 4.x base on 80,000 quality-filtered rows. The synthetic dataset cost less than the labeling budget would have, covered more topics, and the final model beats the small real-data baseline. This is the 2026 picture of synthetic data for fine-tuning: model-generated training rows, judge-filtered, paired with a small real seed, shipped through Hugging Face TRL or Unsloth into the open-source base of your choice.

TL;DR: synthetic data for LLM fine-tuning in one table

WorkflowOutputFine-tune target
Self-Instruct / Evol-Instruct(instruction, response) pairsSFT instruction tuning
Constitutional AI(prompt, safe response)Safety and refusal tuning
DPO / IPO / KTO data(prompt, chosen, rejected) triplesPreference and alignment
Function-calling tracesTool-call sequences with argumentsAgent and tool-use tuning
Synthetic RAG QA(query, retrieved chunks, answer)RAG quality, retrieval eval
Distillation tracesTeacher reasoning chainsSmaller-model reasoning

If you only read one row: synthetic data in 2026 is not generic text. It is specifically shaped for the fine-tuning recipe you are running, from SFT to DPO to agent tuning.

What synthetic data for fine-tuning is, precisely

Synthetic data for fine-tuning is model-generated training rows shaped for a specific fine-tuning recipe. It is not random text from a teacher; it is structured to match the loss function and the schema of the fine-tune target.

  • For SFT (supervised fine-tuning), the row is (instruction, response).
  • For DPO and IPO, the row is (prompt, chosen, rejected).
  • For tool-using agent training, the row is a list of messages with assistant tool calls and tool responses.
  • For RAG and retrieval fine-tuning, the row is (query, retrieved chunks, gold answer).

The pipeline is the same shape across recipes: seed prompts, teacher generation, quality filter, format conversion. The seed is small. The teacher is a frontier model. The filter is a judge model with a rubric. The output is JSONL ready for Hugging Face TRL, Unsloth, Axolotl, Llama Factory, or the OpenAI fine-tuning API.

Why synthetic data, not just real data

Real labeled data is the ground truth but it is rarely affordable at the scale modern fine-tuning needs. A 70B-parameter base needs 10,000 to 100,000 task-specific examples to specialize without forgetting. Human labeling at that scale costs tens to hundreds of thousands of dollars per task. Synthetic data closes the gap: a small real seed anchors the distribution, a teacher model expands it by 10 to 100 times, a judge filters the bottom 10 to 20 percent, and the final dataset is shippable.

The trade-off is mode collapse and teacher-bias inheritance. A student trained purely on synthetic data from one teacher tends to mimic the teacher’s quirks: format, refusal style, length distribution. The 2026 mitigation is to mix teachers, include a real-data seed, and run a diversity check on the generated set.

The 2026 synthetic data workflows

Each workflow below is a recipe: input seed, generation prompt, filter, output schema. Pick the one that matches your fine-tune target.

1. Self-Instruct and Evol-Instruct: instruction tuning at scale

Self-Instruct, introduced in Wang et al. 2022, starts from a small seed of human-written instructions and uses an LLM to expand them. Evol-Instruct (the WizardLM paper, Xu et al. 2023) extends Self-Instruct by evolving each seed along complexity axes: deepening, concretization, reasoning, breadth, and complication.

The 2026 variant:

  1. Write 150 to 200 seed instructions across the task types you care about.
  2. Prompt a teacher (GPT-5 or Claude Opus 4.7) to generate 5 to 10 new instructions per seed, with explicit diversity instructions.
  3. Generate a response for each new instruction.
  4. Run a quality judge over the (instruction, response) pair. Discard rows below threshold.
  5. Optionally: run Evol-Instruct evolution on a subset to lift complexity.
  6. Export to JSONL for SFT.

Output schema is one row per instruction-response pair:

{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}

2. Constitutional AI data: safety and refusal tuning

Constitutional AI, introduced by Anthropic in Bai et al. 2022, generates safety-tuning data by having the model critique and revise its own outputs against a written constitution. The 2026 use of the technique is narrower: generate (prompt, safer response) pairs for the specific harmful or borderline prompts your product faces, not for general safety.

Recipe:

  1. Collect or generate a set of borderline prompts (jailbreak attempts, prompt injections, unsafe-but-ambiguous user asks).
  2. Generate an initial response with the student model.
  3. Prompt the teacher to critique the response against a written constitution.
  4. Prompt the teacher to revise the response based on the critique.
  5. Use the (prompt, revised response) as training data, or use (prompt, initial response, revised response) as a DPO triple where the revised is chosen.

3. DPO, IPO, and KTO preference data

Direct Preference Optimization (Rafailov et al. 2023), Identity Preference Optimization, and Kahneman-Tversky Optimization are the 2026 default alignment losses for open-source fine-tunes. All three need preference triples or pairs.

Recipe:

  1. For each prompt, generate 2 or more responses. Vary the model, the temperature, or the system prompt to create candidates.
  2. Prompt a judge model with a rubric: helpfulness, safety, faithfulness, conciseness.
  3. The judge picks the chosen response and the rejected response.
  4. Export as (prompt, chosen, rejected) JSONL.

The judge rubric is the most-tuned artifact in the recipe. Future AGI’s CustomLLMJudge from fi.evals.metrics is one way to lock the rubric in code.

from fi.evals.metrics import CustomLLMJudge
from fi.evals.llm import LiteLLMProvider

preference_judge = CustomLLMJudge(
    name="dpo_preference_judge",
    grading_criteria=(
        "Compare response_a and response_b for the given prompt. "
        "Pick the one that is more helpful, safe, faithful to any "
        "supplied context, and concise. Output 'a' or 'b'."
    ),
    llm_provider=LiteLLMProvider(model="claude-opus-4-7"),
)

4. Function-calling and tool-use traces for agent fine-tuning

Agent fine-tunes need structured traces, not text pairs. A row is a sequence of messages: user query, assistant tool call, tool response, assistant tool call, tool response, assistant final answer.

Recipe:

  1. Define the tool catalog: function names, JSON Schemas for arguments.
  2. Write 50 to 100 seed user queries that exercise the tools.
  3. Prompt a teacher (GPT-5 or Claude Opus 4.7) with the tool catalog to generate full tool-using trajectories.
  4. Validate that every tool call has well-formed arguments against the schema.
  5. Run a task-adherence judge: did the trajectory actually answer the user?
  6. Export as the messages-with-tool-calls JSONL format that TRL and Unsloth accept.

The schema validation step matters. A row with a malformed tool call teaches the student to emit malformed tool calls.

5. Synthetic RAG QA for retrieval and grounded generation

RAG-tuning needs (query, retrieved chunks, gold answer) triples. The chunks should be real chunks from the production corpus; the queries and answers can be synthetic.

Recipe:

  1. Sample a chunk from the corpus.
  2. Prompt the teacher: given this chunk, generate a user query whose answer is in the chunk.
  3. Prompt the teacher: given the chunk and the query, generate the gold answer.
  4. Add 1 to 3 distractor chunks to the retrieved set so the student learns to ignore irrelevant context.
  5. Run a faithfulness judge on the (query, chunks, answer) row to confirm the answer is supported.

This dataset shape doubles as a retrieval eval set: hold out 5 to 10 percent of the rows and use them to score retrieval recall, faithfulness, and context adherence.

6. Distillation traces: smaller-model reasoning from a frontier teacher

Distillation generates (input, full reasoning trace, output) triples from a frontier teacher and trains a smaller student to match. The 2025-2026 wave of small-but-strong reasoning models (DeepSeek-R1 distillations, Qwen distillations, Llama distillations) was built this way.

Recipe:

  1. Collect reasoning-heavy prompts: math, code, multi-step plans.
  2. Generate a chain-of-thought trace with the teacher.
  3. Optionally filter on whether the final answer is correct against a checker.
  4. Fine-tune a smaller base on (input, trace, output) triples.

The trade-off is teacher-distribution dependence: the student inherits the teacher’s reasoning style and its failure modes.

How to evaluate synthetic dataset quality before fine-tuning

The single most expensive mistake in synthetic data work is fine-tuning on an unfiltered set. A 100K-row dataset with 20 percent low-quality rows is worse than a 80K-row filtered set. Four properties to score before training:

PropertyWhat to checkMetric
DiversitySemantic spread across topics and task typesEmbedding-cluster coverage
CorrectnessResponse is right against a reference or judgeFactual accuracy, faithfulness
Instruction adherenceResponse follows the instructionTask adherence judge
SafetyNo harmful or off-policy contentSafety judge, classifier filters

The 2026 practice has two passes. First, sample 5 to 10 percent of the dataset and score it to estimate whether full-row filtering is worth running. If estimated quality is high, ship the dataset. If a meaningful slice scores below threshold, run a full per-row judge pass and discard the rows below threshold. For a 100K-row dataset, a full-row pass is 100K judge calls at GPT-5 or Claude Opus 4.7 prices, costing on the order of tens to low hundreds of dollars per filter pass.

Future AGI’s ai-evaluation library (Apache 2.0) ships named evaluators for these properties as one-line calls.

from fi.evals import evaluate

# Score each synthetic row for instruction adherence
for row in synthetic_rows:
    result = evaluate(
        "task_adherence",
        output=row["response"],
        input=row["instruction"],
    )
    row["adherence_score"] = result.score

# Keep rows scoring above the threshold
filtered = [r for r in synthetic_rows if r["adherence_score"] >= 0.7]

For faithfulness on RAG-style synthetic rows, the call shape is the same with the retrieved chunks supplied as context.

Tools and libraries that matter for synthetic data in 2026

The synthetic-data tool landscape consolidated in 2025-2026 around a few common patterns. The table below covers tools used in production fine-tuning workflows.

Tool / libraryWhat it doesWhere it fits
Hugging Face TRLDPO, IPO, KTO, SFT trainers; reference DPO recipesFine-tune step
UnslothFast SFT / DPO on consumer GPUs with LoRA / QLoRAFine-tune step
AxolotlYAML-configured training across SFT and DPOFine-tune step
Llama FactoryUI and config-driven fine-tuning across many open basesFine-tune step
Hugging Face DatasetsStorage, streaming, and curation of synthetic datasetsData curation
Distilabel (Argilla)DPO and SFT synthetic data pipelinesData generation
Self-Instruct repo (yizhongw)Original Self-Instruct recipeReference workflow
OpenAI Evals / lm-evaluation-harnessBenchmark the fine-tuned studentPost-train eval
Future AGI ai-evaluationFaithfulness, task adherence, safety judges as one-line evaluate()Dataset quality filter
Future AGI traceAI (Apache 2.0)OpenInference spans around generate / judge callsPipeline observability

The fine-tuning step itself is not Future AGI’s niche. The places Future AGI fits are the dataset quality filter, the evaluation of the fine-tuned student, and the runtime observability of the deployed model.

A six-step process for synthetic data fine-tuning in 2026

  1. Define the fine-tune target. Pick SFT, DPO, IPO, KTO, or agent tuning. The choice determines the row schema and the loss function.
  2. Write the seed. 150 to 200 human-written examples covering the task types and edge cases.
  3. Pick the teacher. GPT-5, Claude Opus 4.7, Gemini 3.x for instruction and preference data; Llama 4.x or Mixtral 8x22B for permissive-license needs; Qwen or DeepSeek-V3 for code and math.
  4. Generate and validate. Run the workflow that matches the target. Validate schema (tool-call JSON, response format) at this step.
  5. Filter with a judge. Score each row for instruction adherence, faithfulness, safety, and correctness. Discard the bottom 10 to 20 percent.
  6. Train and evaluate. Run the fine-tune. Hold out 5 to 10 percent of the filtered dataset for an offline regression suite. Re-score with the same evaluators on the trained student.

The cycle is iterative. After the first fine-tune, re-score the student’s outputs, find the failure modes, generate targeted synthetic data for those modes, and re-train.

Risks and limitations

  • Teacher bias inheritance. A student trained on one teacher inherits its quirks. Mix teachers when possible.
  • Mode collapse. Pure synthetic training without a real seed tends to collapse on the long tail of real user inputs.
  • Licensing. Frontier model terms-of-service often restrict using outputs to train a competing model. Read the terms before shipping a fine-tune.
  • Quality drift. A judge prompt that worked on the first 10K rows may not catch failure modes in rows 50K to 100K. Re-validate the judge periodically.
  • Distribution mismatch. Synthetic queries from a teacher rarely match real user queries exactly. Plan for a real-user fine-tune pass after the synthetic warm-up.

How Future AGI fits in the synthetic fine-tuning stack

Fine-tuning itself runs in Hugging Face TRL, Unsloth, Axolotl, Llama Factory, or the OpenAI fine-tuning API. Future AGI is not a fine-tuning framework. The places Future AGI fits are upstream and downstream of the training step:

  • Upstream: dataset quality filter. The ai-evaluation library (Apache 2.0) scores synthetic rows for instruction adherence, faithfulness, safety, and task adherence with named one-line evaluate() calls. The cost is a judge call per row; the win is discarding the noisy bottom of the dataset before it pollutes the fine-tune.
  • Upstream: dataset generation observability. traceAI (Apache 2.0) wraps the teacher generation pipeline in OpenInference spans so each generated row is a trace and each judge call is a child span. Failed-judge filtering becomes a span query, not a log scrape.
  • Downstream: student evaluation. The same evaluator templates score the fine-tuned student on a held-out set. Faithfulness on RAG questions, task adherence on agent rollouts, hallucination on free-form generation.
  • Downstream: simulation. fi.simulate.TestRunner runs scripted agent interactions against the fine-tuned student to catch behavioral regressions before deployment.

For runtime traffic from the deployed fine-tune, the Agent Command Center at /platform/monitor/command-center routes calls through the same evaluators as inline guardrails, gating responses that fail a faithfulness or safety check before they reach the user.

Use cases by domain

Healthcare. Synthetic patient-encounter dialogues, de-identified clinical notes, and (query, chunk, answer) triples over a curated medical corpus. The judge rubric includes safety and a no-fabricated-citations rule.

Customer support. Synthetic ticket and conversation pairs covering the long tail of product issues. Function-calling traces for agents that look up orders, run refunds, and update accounts.

Legal. Synthetic clause analysis and Q&A over generated contracts. The judge rubric includes jurisdiction tagging and a no-hallucinated-citations rule.

Finance. Synthetic transaction patterns, fraud-flag explanations, and analyst-summary pairs. The judge rubric weights factual accuracy heavily.

Code. Synthetic (problem, solution) pairs and tool-using traces over real APIs. The judge step often includes a runtime checker that executes the code.

Summary

Synthetic data is the default scaling primitive for LLM fine-tuning in 2026. The workflows that matter are Self-Instruct and Evol-Instruct for SFT, Constitutional AI for safety, DPO and IPO for preference alignment, function-calling traces for agents, RAG QA for retrieval-grounded tuning, and distillation for smaller reasoning models. The recipe is consistent: small real seed, frontier teacher, structured generation, judge filter, JSONL output into TRL or Unsloth. The single most important step is the judge filter; an unfiltered synthetic dataset is worse than a smaller filtered one. Future AGI’s ai-evaluation library (Apache 2.0) supplies the named evaluators for that filter and for the downstream student evaluation, and traceAI (Apache 2.0) covers the pipeline observability.

Frequently asked questions

What is synthetic data for LLM fine-tuning in 2026?
Synthetic data for LLM fine-tuning is model-generated training data that replaces or augments scarce, expensive, or privacy-sensitive human-labeled data. In 2026 the workflows that matter are Self-Instruct and Evol-Instruct for instruction tuning, Constitutional AI for safety and refusal, DPO and IPO preference pairs for alignment, function-calling traces for tool-using agents, and synthetic RAG QA for retrieval evaluation. The output of every workflow is a JSON Lines file that loads directly into the major fine-tuning frameworks: Hugging Face TRL, Unsloth, Axolotl, Llama Factory, and the OpenAI fine-tuning API.
How is synthetic data different from data augmentation?
Data augmentation transforms existing examples: paraphrase, back-translate, swap a synonym, add noise. Synthetic data creates new examples from a seed prompt or a teacher model with no original example to anchor on. Augmentation preserves the underlying labeled distribution; synthetic generation expands or replaces it. In 2026 fine-tuning workflows often combine both: a small seed of human-labeled examples, an augmentation pass for surface variation, and a synthetic pass with a larger teacher model for coverage of new topics and edge cases.
Which teacher model should I use to generate synthetic data?
The 2026 default for instruction and reasoning data is a frontier model: GPT-5 (gpt-5-2025-08-07), Claude Opus 4.7, or Gemini 3.x. For preference data the same models double as judges. For open-source-only pipelines, Llama 4.x and Mixtral 8x22B work for instruction generation; Qwen and DeepSeek-V3 are common for code and math. The licensing rule is unchanged: read the provider's terms-of-service before using outputs to train a competing model. OpenAI's terms restrict training competing models from outputs; many open-source teachers permit it under their license.
What is Self-Instruct and how does it work in 2026?
Self-Instruct is a synthetic instruction-tuning workflow that starts with a small set of seed tasks (often 150 to 200 human-written), uses a teacher LLM to expand them into thousands of new instructions, generates responses for each, filters for quality, and writes the result to a JSONL fine-tuning file. The 2026 variant uses a stronger teacher (GPT-5 or Claude Opus 4.7), a diversity-promoting prompt that asks for novel task types, and a quality judge that re-scores generated responses before they enter the training set. Evol-Instruct is the popular extension that takes a seed instruction and evolves it along five complexity axes.
How do I generate DPO or IPO preference data with synthetic outputs?
Generate two or more responses for each prompt with the same or different models at different temperatures, then have a judge model pick the preferred response. Each row in the DPO dataset is (prompt, chosen, rejected). For IPO and KTO the row structure is similar but the loss function differs. The judge should be a different model or a stronger one than the student to avoid mode collapse, and the rubric should be explicit: helpfulness, safety, faithfulness to context, conciseness. Future AGI's CustomLLMJudge from fi.evals.metrics is one way to formalize the rubric in code.
How do I evaluate synthetic dataset quality before fine-tuning?
Score four properties before training: diversity (semantic spread across topics and task types), correctness (response is right against a reference or a judge), instruction adherence (response follows the instruction), and safety (no harmful or off-policy content). The 2026 practice is to run a judge model over a random 5 to 10 percent sample of the synthetic dataset and discard any row scoring below a threshold. Future AGI's ai-evaluation library (Apache 2.0) ships named evaluators for instruction adherence, faithfulness, and safety as one-line evaluate() calls. The cost is roughly one judge call per row; the win is filtering out the noisy bottom 10 to 20 percent before it pollutes the fine-tune.
Should I use synthetic data instead of real data?
No. The 2026 consensus is to use synthetic data as the bulk and a smaller real-data seed for grounding. A typical recipe is 200 to 2000 human-written seed examples, 10 to 100 times that volume of synthetic generations, and a quality-filtered final dataset. Pure synthetic fine-tuning works for narrow tasks but tends to collapse on the long tail of real user behavior. Pure real data is the ideal but rarely affordable at the scale modern fine-tuning needs.
What changed between 2025 and 2026 in synthetic data for fine-tuning?
Three things. First, teacher models got cheaper and better. A GPT-5 or Claude Opus 4.7 judge pass is now affordable on a 100K-row dataset. Second, the workflow shifted from instruction tuning to preference tuning: DPO, IPO, and KTO are now the default alignment step, and synthetic pipelines now produce (prompt, chosen, rejected) triples instead of just (prompt, response) pairs. Third, agent fine-tuning emerged as a separate workflow: synthetic function-calling and multi-turn tool-use traces are now generated explicitly for tool-using agents, scored by a task-adherence judge, and shipped into TRL or Unsloth fine-tunes.
Related Articles
View all
Stay updated on AI observability

Get weekly insights on building reliable AI systems. No spam.