Synthetic Data for LLM Fine-Tuning in 2026: Self-Instruct, Constitutional AI, DPO Data, and Function-Calling Traces
Generate synthetic data to fine-tune LLMs in 2026. Self-Instruct, Constitutional AI, DPO/IPO traces, function calling, and how to evaluate dataset quality.
Table of Contents
A team needed 50,000 high-quality customer-support conversations to fine-tune an internal agent. The privacy team blocked the use of real tickets. The labeling budget covered 800 examples. A 2024-era team would have shipped a small model that generalized poorly. A 2026 team writes 200 seed conversations, expands them with Self-Instruct against GPT-5, generates DPO preference pairs with a Claude Opus 4.7 judge, runs a faithfulness and instruction-adherence pass over the synthetic rows, and trains a Llama 4.x base on 80,000 quality-filtered rows. The synthetic dataset cost less than the labeling budget would have, covered more topics, and the final model beats the small real-data baseline. This is the 2026 picture of synthetic data for fine-tuning: model-generated training rows, judge-filtered, paired with a small real seed, shipped through Hugging Face TRL or Unsloth into the open-source base of your choice.
TL;DR: synthetic data for LLM fine-tuning in one table
| Workflow | Output | Fine-tune target |
|---|---|---|
| Self-Instruct / Evol-Instruct | (instruction, response) pairs | SFT instruction tuning |
| Constitutional AI | (prompt, safe response) | Safety and refusal tuning |
| DPO / IPO / KTO data | (prompt, chosen, rejected) triples | Preference and alignment |
| Function-calling traces | Tool-call sequences with arguments | Agent and tool-use tuning |
| Synthetic RAG QA | (query, retrieved chunks, answer) | RAG quality, retrieval eval |
| Distillation traces | Teacher reasoning chains | Smaller-model reasoning |
If you only read one row: synthetic data in 2026 is not generic text. It is specifically shaped for the fine-tuning recipe you are running, from SFT to DPO to agent tuning.
What synthetic data for fine-tuning is, precisely
Synthetic data for fine-tuning is model-generated training rows shaped for a specific fine-tuning recipe. It is not random text from a teacher; it is structured to match the loss function and the schema of the fine-tune target.
- For SFT (supervised fine-tuning), the row is (instruction, response).
- For DPO and IPO, the row is (prompt, chosen, rejected).
- For tool-using agent training, the row is a list of messages with assistant tool calls and tool responses.
- For RAG and retrieval fine-tuning, the row is (query, retrieved chunks, gold answer).
The pipeline is the same shape across recipes: seed prompts, teacher generation, quality filter, format conversion. The seed is small. The teacher is a frontier model. The filter is a judge model with a rubric. The output is JSONL ready for Hugging Face TRL, Unsloth, Axolotl, Llama Factory, or the OpenAI fine-tuning API.
Why synthetic data, not just real data
Real labeled data is the ground truth but it is rarely affordable at the scale modern fine-tuning needs. A 70B-parameter base needs 10,000 to 100,000 task-specific examples to specialize without forgetting. Human labeling at that scale costs tens to hundreds of thousands of dollars per task. Synthetic data closes the gap: a small real seed anchors the distribution, a teacher model expands it by 10 to 100 times, a judge filters the bottom 10 to 20 percent, and the final dataset is shippable.
The trade-off is mode collapse and teacher-bias inheritance. A student trained purely on synthetic data from one teacher tends to mimic the teacher’s quirks: format, refusal style, length distribution. The 2026 mitigation is to mix teachers, include a real-data seed, and run a diversity check on the generated set.
The 2026 synthetic data workflows
Each workflow below is a recipe: input seed, generation prompt, filter, output schema. Pick the one that matches your fine-tune target.
1. Self-Instruct and Evol-Instruct: instruction tuning at scale
Self-Instruct, introduced in Wang et al. 2022, starts from a small seed of human-written instructions and uses an LLM to expand them. Evol-Instruct (the WizardLM paper, Xu et al. 2023) extends Self-Instruct by evolving each seed along complexity axes: deepening, concretization, reasoning, breadth, and complication.
The 2026 variant:
- Write 150 to 200 seed instructions across the task types you care about.
- Prompt a teacher (GPT-5 or Claude Opus 4.7) to generate 5 to 10 new instructions per seed, with explicit diversity instructions.
- Generate a response for each new instruction.
- Run a quality judge over the (instruction, response) pair. Discard rows below threshold.
- Optionally: run Evol-Instruct evolution on a subset to lift complexity.
- Export to JSONL for SFT.
Output schema is one row per instruction-response pair:
{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
2. Constitutional AI data: safety and refusal tuning
Constitutional AI, introduced by Anthropic in Bai et al. 2022, generates safety-tuning data by having the model critique and revise its own outputs against a written constitution. The 2026 use of the technique is narrower: generate (prompt, safer response) pairs for the specific harmful or borderline prompts your product faces, not for general safety.
Recipe:
- Collect or generate a set of borderline prompts (jailbreak attempts, prompt injections, unsafe-but-ambiguous user asks).
- Generate an initial response with the student model.
- Prompt the teacher to critique the response against a written constitution.
- Prompt the teacher to revise the response based on the critique.
- Use the (prompt, revised response) as training data, or use (prompt, initial response, revised response) as a DPO triple where the revised is chosen.
3. DPO, IPO, and KTO preference data
Direct Preference Optimization (Rafailov et al. 2023), Identity Preference Optimization, and Kahneman-Tversky Optimization are the 2026 default alignment losses for open-source fine-tunes. All three need preference triples or pairs.
Recipe:
- For each prompt, generate 2 or more responses. Vary the model, the temperature, or the system prompt to create candidates.
- Prompt a judge model with a rubric: helpfulness, safety, faithfulness, conciseness.
- The judge picks the chosen response and the rejected response.
- Export as (prompt, chosen, rejected) JSONL.
The judge rubric is the most-tuned artifact in the recipe. Future AGI’s CustomLLMJudge from fi.evals.metrics is one way to lock the rubric in code.
from fi.evals.metrics import CustomLLMJudge
from fi.evals.llm import LiteLLMProvider
preference_judge = CustomLLMJudge(
name="dpo_preference_judge",
grading_criteria=(
"Compare response_a and response_b for the given prompt. "
"Pick the one that is more helpful, safe, faithful to any "
"supplied context, and concise. Output 'a' or 'b'."
),
llm_provider=LiteLLMProvider(model="claude-opus-4-7"),
)
4. Function-calling and tool-use traces for agent fine-tuning
Agent fine-tunes need structured traces, not text pairs. A row is a sequence of messages: user query, assistant tool call, tool response, assistant tool call, tool response, assistant final answer.
Recipe:
- Define the tool catalog: function names, JSON Schemas for arguments.
- Write 50 to 100 seed user queries that exercise the tools.
- Prompt a teacher (GPT-5 or Claude Opus 4.7) with the tool catalog to generate full tool-using trajectories.
- Validate that every tool call has well-formed arguments against the schema.
- Run a task-adherence judge: did the trajectory actually answer the user?
- Export as the messages-with-tool-calls JSONL format that TRL and Unsloth accept.
The schema validation step matters. A row with a malformed tool call teaches the student to emit malformed tool calls.
5. Synthetic RAG QA for retrieval and grounded generation
RAG-tuning needs (query, retrieved chunks, gold answer) triples. The chunks should be real chunks from the production corpus; the queries and answers can be synthetic.
Recipe:
- Sample a chunk from the corpus.
- Prompt the teacher: given this chunk, generate a user query whose answer is in the chunk.
- Prompt the teacher: given the chunk and the query, generate the gold answer.
- Add 1 to 3 distractor chunks to the retrieved set so the student learns to ignore irrelevant context.
- Run a faithfulness judge on the (query, chunks, answer) row to confirm the answer is supported.
This dataset shape doubles as a retrieval eval set: hold out 5 to 10 percent of the rows and use them to score retrieval recall, faithfulness, and context adherence.
6. Distillation traces: smaller-model reasoning from a frontier teacher
Distillation generates (input, full reasoning trace, output) triples from a frontier teacher and trains a smaller student to match. The 2025-2026 wave of small-but-strong reasoning models (DeepSeek-R1 distillations, Qwen distillations, Llama distillations) was built this way.
Recipe:
- Collect reasoning-heavy prompts: math, code, multi-step plans.
- Generate a chain-of-thought trace with the teacher.
- Optionally filter on whether the final answer is correct against a checker.
- Fine-tune a smaller base on (input, trace, output) triples.
The trade-off is teacher-distribution dependence: the student inherits the teacher’s reasoning style and its failure modes.
How to evaluate synthetic dataset quality before fine-tuning
The single most expensive mistake in synthetic data work is fine-tuning on an unfiltered set. A 100K-row dataset with 20 percent low-quality rows is worse than a 80K-row filtered set. Four properties to score before training:
| Property | What to check | Metric |
|---|---|---|
| Diversity | Semantic spread across topics and task types | Embedding-cluster coverage |
| Correctness | Response is right against a reference or judge | Factual accuracy, faithfulness |
| Instruction adherence | Response follows the instruction | Task adherence judge |
| Safety | No harmful or off-policy content | Safety judge, classifier filters |
The 2026 practice has two passes. First, sample 5 to 10 percent of the dataset and score it to estimate whether full-row filtering is worth running. If estimated quality is high, ship the dataset. If a meaningful slice scores below threshold, run a full per-row judge pass and discard the rows below threshold. For a 100K-row dataset, a full-row pass is 100K judge calls at GPT-5 or Claude Opus 4.7 prices, costing on the order of tens to low hundreds of dollars per filter pass.
Future AGI’s ai-evaluation library (Apache 2.0) ships named evaluators for these properties as one-line calls.
from fi.evals import evaluate
# Score each synthetic row for instruction adherence
for row in synthetic_rows:
result = evaluate(
"task_adherence",
output=row["response"],
input=row["instruction"],
)
row["adherence_score"] = result.score
# Keep rows scoring above the threshold
filtered = [r for r in synthetic_rows if r["adherence_score"] >= 0.7]
For faithfulness on RAG-style synthetic rows, the call shape is the same with the retrieved chunks supplied as context.
Tools and libraries that matter for synthetic data in 2026
The synthetic-data tool landscape consolidated in 2025-2026 around a few common patterns. The table below covers tools used in production fine-tuning workflows.
| Tool / library | What it does | Where it fits |
|---|---|---|
| Hugging Face TRL | DPO, IPO, KTO, SFT trainers; reference DPO recipes | Fine-tune step |
| Unsloth | Fast SFT / DPO on consumer GPUs with LoRA / QLoRA | Fine-tune step |
| Axolotl | YAML-configured training across SFT and DPO | Fine-tune step |
| Llama Factory | UI and config-driven fine-tuning across many open bases | Fine-tune step |
| Hugging Face Datasets | Storage, streaming, and curation of synthetic datasets | Data curation |
| Distilabel (Argilla) | DPO and SFT synthetic data pipelines | Data generation |
| Self-Instruct repo (yizhongw) | Original Self-Instruct recipe | Reference workflow |
| OpenAI Evals / lm-evaluation-harness | Benchmark the fine-tuned student | Post-train eval |
| Future AGI ai-evaluation | Faithfulness, task adherence, safety judges as one-line evaluate() | Dataset quality filter |
| Future AGI traceAI (Apache 2.0) | OpenInference spans around generate / judge calls | Pipeline observability |
The fine-tuning step itself is not Future AGI’s niche. The places Future AGI fits are the dataset quality filter, the evaluation of the fine-tuned student, and the runtime observability of the deployed model.
A six-step process for synthetic data fine-tuning in 2026
- Define the fine-tune target. Pick SFT, DPO, IPO, KTO, or agent tuning. The choice determines the row schema and the loss function.
- Write the seed. 150 to 200 human-written examples covering the task types and edge cases.
- Pick the teacher. GPT-5, Claude Opus 4.7, Gemini 3.x for instruction and preference data; Llama 4.x or Mixtral 8x22B for permissive-license needs; Qwen or DeepSeek-V3 for code and math.
- Generate and validate. Run the workflow that matches the target. Validate schema (tool-call JSON, response format) at this step.
- Filter with a judge. Score each row for instruction adherence, faithfulness, safety, and correctness. Discard the bottom 10 to 20 percent.
- Train and evaluate. Run the fine-tune. Hold out 5 to 10 percent of the filtered dataset for an offline regression suite. Re-score with the same evaluators on the trained student.
The cycle is iterative. After the first fine-tune, re-score the student’s outputs, find the failure modes, generate targeted synthetic data for those modes, and re-train.
Risks and limitations
- Teacher bias inheritance. A student trained on one teacher inherits its quirks. Mix teachers when possible.
- Mode collapse. Pure synthetic training without a real seed tends to collapse on the long tail of real user inputs.
- Licensing. Frontier model terms-of-service often restrict using outputs to train a competing model. Read the terms before shipping a fine-tune.
- Quality drift. A judge prompt that worked on the first 10K rows may not catch failure modes in rows 50K to 100K. Re-validate the judge periodically.
- Distribution mismatch. Synthetic queries from a teacher rarely match real user queries exactly. Plan for a real-user fine-tune pass after the synthetic warm-up.
How Future AGI fits in the synthetic fine-tuning stack
Fine-tuning itself runs in Hugging Face TRL, Unsloth, Axolotl, Llama Factory, or the OpenAI fine-tuning API. Future AGI is not a fine-tuning framework. The places Future AGI fits are upstream and downstream of the training step:
- Upstream: dataset quality filter. The ai-evaluation library (Apache 2.0) scores synthetic rows for instruction adherence, faithfulness, safety, and task adherence with named one-line evaluate() calls. The cost is a judge call per row; the win is discarding the noisy bottom of the dataset before it pollutes the fine-tune.
- Upstream: dataset generation observability. traceAI (Apache 2.0) wraps the teacher generation pipeline in OpenInference spans so each generated row is a trace and each judge call is a child span. Failed-judge filtering becomes a span query, not a log scrape.
- Downstream: student evaluation. The same evaluator templates score the fine-tuned student on a held-out set. Faithfulness on RAG questions, task adherence on agent rollouts, hallucination on free-form generation.
- Downstream: simulation.
fi.simulate.TestRunnerruns scripted agent interactions against the fine-tuned student to catch behavioral regressions before deployment.
For runtime traffic from the deployed fine-tune, the Agent Command Center at /platform/monitor/command-center routes calls through the same evaluators as inline guardrails, gating responses that fail a faithfulness or safety check before they reach the user.
Use cases by domain
Healthcare. Synthetic patient-encounter dialogues, de-identified clinical notes, and (query, chunk, answer) triples over a curated medical corpus. The judge rubric includes safety and a no-fabricated-citations rule.
Customer support. Synthetic ticket and conversation pairs covering the long tail of product issues. Function-calling traces for agents that look up orders, run refunds, and update accounts.
Legal. Synthetic clause analysis and Q&A over generated contracts. The judge rubric includes jurisdiction tagging and a no-hallucinated-citations rule.
Finance. Synthetic transaction patterns, fraud-flag explanations, and analyst-summary pairs. The judge rubric weights factual accuracy heavily.
Code. Synthetic (problem, solution) pairs and tool-using traces over real APIs. The judge step often includes a runtime checker that executes the code.
Summary
Synthetic data is the default scaling primitive for LLM fine-tuning in 2026. The workflows that matter are Self-Instruct and Evol-Instruct for SFT, Constitutional AI for safety, DPO and IPO for preference alignment, function-calling traces for agents, RAG QA for retrieval-grounded tuning, and distillation for smaller reasoning models. The recipe is consistent: small real seed, frontier teacher, structured generation, judge filter, JSONL output into TRL or Unsloth. The single most important step is the judge filter; an unfiltered synthetic dataset is worse than a smaller filtered one. Future AGI’s ai-evaluation library (Apache 2.0) supplies the named evaluators for that filter and for the downstream student evaluation, and traceAI (Apache 2.0) covers the pipeline observability.
Frequently asked questions
What is synthetic data for LLM fine-tuning in 2026?
How is synthetic data different from data augmentation?
Which teacher model should I use to generate synthetic data?
What is Self-Instruct and how does it work in 2026?
How do I generate DPO or IPO preference data with synthetic outputs?
How do I evaluate synthetic dataset quality before fine-tuning?
Should I use synthetic data instead of real data?
What changed between 2025 and 2026 in synthetic data for fine-tuning?
LLM evaluation in 2026: deterministic metrics, LLM-as-judge, RAG metrics, agent metrics, and how to wire offline regression plus runtime guardrails.
What LLM hallucination is in 2026, the six types, why models fabricate, and how to detect each one with faithfulness, groundedness, and context-adherence scores.
LLM fine-tuning techniques in 2026: feature-based, full fine-tune, LoRA, QLoRA, BitFit, SFT, DPO, RLHF, multi-task. When to use each and how to evaluate.