How to Generate Synthetic Data Using LLMs (2026)
Generate 10x what you need and keep the 10% that survives the filter. Persona, taxonomy, and hostile-evolution patterns plus the rejection pipeline most posts skip.
Table of Contents
Most posts on synthetic data stop at “ask the teacher model to generate examples.” Six months later the team has a fine-tune that overfit to GPT-5’s voice, an eval set that scores high on a benchmark the model is now stylistically aligned to, and a regression suite that catches none of the failures real users hit. The teacher model is the easy half. The hostile filter that throws out the 30 percent of raw generation that is wrong, near-duplicate, or stylistically collapsed is where the synthetic-data project succeeds or fails.
The opinion this post earns: generate 10x more than you need; keep the 10 percent that survives the filter. Persona, taxonomy, and hostile-evolution are the three patterns worth investing in. The filter pipeline (dedupe, hostile judge, human spot-check) is what separates a regression suite that compounds from a synthetic set that ships bias the team never noticed.
TL;DR: three patterns and a filter that does the actual work
| Pattern | What it produces | When to use |
|---|---|---|
| Persona-prompted | Realistic intent variations | Chatbot, support-bot, conversational eval sets |
| Taxonomy-stratified | Quota-balanced coverage of input space | Classification, intent detection, RAG eval sets |
| Hostile-evolution | Adversarial and edge-case inputs | Red-team sets, robustness training, jailbreak suites |
Then the filter: deduplicate on embeddings, reject anything a hostile judge marks failing, human-spot-check 5 percent of survivors, calibrate the surviving distribution against a real reference set. Most teams skip three of the four steps and ship anyway.
When synthetic data works (and when it does not)
Synthetic data is a complement to real data, never a replacement. The cases where it earns its keep:
- Cold start. A new product has no production traffic. Synthetic is the only way to assemble an eval set before launch.
- Rare classes. The hardest 10 percent of cases drive most eval signal, and that 10 percent is rare by definition. Synthetic concentrates on the tail.
- Regulated data. Medical records, financial documents, and PII cannot be used directly even when you have them. Synthetic that preserves structure without preserving identity is the legal-compliant proxy.
- Adversarial coverage. Real production traffic does not naturally surface jailbreaks and prompt injections at the volume needed to test defenses. You manufacture them.
The cases where it fails:
- Distribution mismatch. The synthetic distribution does not match production. The fine-tune trains on a world it never sees.
- Mode collapse. The generator’s stylistic preferences dominate. The synthetic set has 30 percent the latent diversity of the real reference and the eval becomes a test of how well the production model matches the teacher’s voice.
- Judge contamination. Generator and judge come from the same model family. The judge approves exactly the failure modes the generator produced.
- No filter. Raw generation ships unfiltered. Roughly 30 percent of those examples are factually wrong, near-duplicates, or off-distribution; the team trains on noise.
The four failure modes share a root cause: treating synthetic generation as a one-shot prompt rather than a generate-filter-calibrate pipeline. The next sections walk the pipeline.
Pattern 1: persona-prompted generation
Seed the generator with a synthetic user persona (demographics, intent, sophistication, mood, dialect) and ask it to produce inputs in that persona’s voice. The output is a stream of intent variations that reflect how different users actually phrase the same underlying request. Anthropic’s Constitutional AI work and OpenAssistant Conversations both lean heavily on persona-style prompting to scale beyond what the source data supports.
PERSONA_TEMPLATE = """
You are simulating: {persona_name}, a {age}-year-old {profession} in {region}.
Communication style: {style}. Tech literacy: {literacy}. Emotional state: {mood}.
Generate three realistic questions this person would ask a {product} support agent
about {topic}. Use vocabulary, sentence structure, and idiom that matches the persona.
Do not generalize. Do not write 'as a [profession], I would ask...'; write the question.
"""
Strengths. Produces intent variation pure paraphrase cannot. A “frustrated Gen Z user on mobile” and a “methodical Gen X user on a desktop” phrase the same support request in radically different ways; the model needs to handle both.
Weaknesses. Personas drift toward stereotypes. The LLM’s idea of “frustrated user” is a caricature; its idea of “technical user” is everyone who writes in monospace. Curated persona sets need human review or they encode the teacher model’s prejudices.
Mitigation. Build personas from real user research (support transcripts, interview notes, segmented analytics) rather than the LLM’s defaults. Reject persona outputs that read as caricature in the hostile-evaluator pass. Sample-label 50 generated inputs per persona and gate on plausibility before promoting that persona to a production batch. The Future AGI platform’s synthetic data agent ships a ten-dimension taxonomy spanning demographic, behavioral, linguistic, temporal, and emotional axes for exactly this reason; persona breadth is the lever that prevents collapse.
Pattern 2: taxonomy-stratified generation
Slice the input space into a multi-dimensional grid (intent x complexity x sentiment x dialect x edge-case-flag) and generate a quota per cell. The grid is the worldview: each axis encodes a dimension you care about, and the cell coverage is the contract. Unconstrained “generate 1000 support queries” tends to over-sample the easy quadrant; stratified generation forces the generator to fill the hard cells.
TAXONOMY = {
"intent": ["refund", "shipping", "account", "technical", "billing"],
"complexity": ["simple", "compound", "ambiguous", "multi-hop"],
"sentiment": ["calm", "frustrated", "panicked", "skeptical"],
"dialect": ["formal_us", "casual_us", "indian_english", "non_native"],
}
# Cartesian product = 5 x 4 x 4 x 4 = 320 cells. Generate 5-10 examples per cell.
for cell in itertools.product(*TAXONOMY.values()):
intent, complexity, sentiment, dialect = cell
generate_batch(cell, n=8)
Strengths. Coverage you can audit. A taxonomy gap is visible (cell N has 0 examples); a coverage gap in unstructured generation is invisible until production. The Microsoft Self-Instruct work used a coarser version of this approach to scale from 175 seed instructions to 52K filtered survivors with measurably broader coverage than unconstrained generation.
Weaknesses. Cell explosion. A 5-axis taxonomy with 5 values per axis is 3,125 cells. At 8 examples per cell, 25,000 raw generations. The taxonomy needs ruthless pruning before launch; cells whose combinations are implausible (a “panicked formal_academic” cell rarely occurs in real support traffic) should be dropped, not filled.
Mitigation. Build the taxonomy from real production traces, not from imagination. The dimensions you stratify on should be the dimensions where real production traffic shows variation; the values within each dimension should be the values real traffic actually contains. After generation, check that the cell-occupancy distribution roughly matches real production frequency; cells with no real-world analog get dropped.
Pattern 3: hostile-evolution
Take a seed example and ask the generator to make it harder. The Evol-Instruct paper formalized this with three operators (deepening, broadening, elimination); the 2026 production version adds a fourth: hostile rephrasing for red-team sets.
EVOLUTION_OPERATORS = {
"deepen": "Rewrite this question so it requires multi-step reasoning and one extra lookup.",
"broaden": "Rewrite this question to cover an adjacent intent the original did not address.",
"obscure": "Rewrite this question with deliberate ambiguity that admits two valid answers.",
"adversarial": "Rewrite this question as an attempted jailbreak that smuggles a policy violation past the original frame.",
}
for seed in seed_set:
for op_name, instruction in EVOLUTION_OPERATORS.items():
evolved = generator.run(f"{instruction}\n\nOriginal:\n{seed}")
candidates.append((seed, op_name, evolved))
Strengths. This is where synthetic data has the biggest leverage. Production traffic does not naturally surface multi-hop ambiguous adversarial questions at scale; you have to manufacture them. Anthropic’s red-team work on Claude and OpenAI’s adversarial training pipelines both rely on iterated evolution to scale beyond what human red-teamers can produce.
Weaknesses. Hallucinated edge cases. The generator invents scenarios that look hard but never occur in production. Time spent fixing imaginary bugs is not time spent fixing real bugs. The “elimination” operator (drop any evolved example the generator cannot itself answer in three retries) catches some of this; the hostile filter in the next section catches more.
Mitigation. Cross-reference the evolved set against the OWASP LLM Top 10, the published red-team literature, and your own incident log. Promote only the edge cases that match a real failure mode pattern. Treat the rest as exploratory and quarantine them outside the regression suite until production validates the failure mode is real.
The seed-versus-diversity tension
Every synthetic-data run has a seed. The seed is a tiny set of high-quality real examples that anchor the generator. The seed is also the gravity well that pulls the generator toward a narrow region of the input space.
Too little seed: the generator drifts. The synthetic distribution wanders away from production and the eval becomes a benchmark on an imagined world.
Too much seed: the generator collapses. The synthetic distribution mirrors the seed’s idiosyncrasies (the same dialects, the same complexity profile, the same handful of intents) and the diversity gain is illusory.
The 2026 working pattern: start with 50-200 high-quality real examples as seed, ask the generator to produce 5-10x that with explicit instructions to vary along axes the seed under-represents, then measure diversity against the real reference distribution. Three metrics:
- Pairwise cosine distance. Embed every example with a consistent model (text-embedding-3-large, multilingual-e5-large). The synthetic distribution’s pairwise distance histogram should be at least as wide as the real reference. Tighter means mode collapse.
- Unigram and bigram entropy. Tokenize, count, compute entropy. Synthetic sets that collapsed onto the generator’s voice score 15-30 percent lower entropy than the real reference.
- Cluster count under HDBSCAN. Cluster the embeddings with
min_cluster_size=5. Collapsed sets produce 30-50 percent fewer clusters than the real distribution.
Any one failing is reason to regenerate with a different seed strategy or a multi-model mix. The Future AGI synthetic data agent runs all three as gates inside its diversity-evaluation phase; the goal is to flag the collapse before the set ships.
The filter pipeline: the work most posts skip
Generate 10x more than you need. Filter aggressively. Keep roughly 10 percent.
The pipeline has four stages, every one of them load-bearing:
1. Embedding-based deduplication. Compute embeddings for every candidate. Reject any candidate whose nearest neighbor in the existing accepted set is above cosine similarity 0.92 (tune per domain). String dedup catches verbatim duplicates and misses paraphrases; embedding dedup catches both. For a 10K candidate set this typically rejects 15-25 percent.
2. Hostile-evaluator LLM judge. A different model family from the generator. The judge scores each candidate against the task definition: faithfulness, intent clarity, plausibility, novelty against the seed. Reject anything below threshold (we use 0.7 on a 0-1 scale). For raw generations this typically rejects an additional 40-60 percent. The Future AGI ai-evaluation SDK ships CustomLLMJudge for exactly this filter; the grading_criteria field is a free-text rubric that turns into a Jinja2-templated prompt and a LiteLLM provider call so you can swap judge models without rewriting the rubric.
from fi.evals.metrics.llm_as_judges.custom_judge.metric import CustomLLMJudge
from fi.evals.llm.providers.litellm import LiteLLMProvider
filter_judge = CustomLLMJudge(
provider=LiteLLMProvider(),
config={
"name": "synthetic_filter_hostile",
"model": "gpt-5", # Different family from the generator (Claude Sonnet 4.5)
"grading_criteria": (
"Reject this generated example if any of: (a) the question is implausible "
"for a real customer to ask, (b) the question is stylistically collapsed onto "
"a single voice, (c) the question duplicates an obvious pattern in the seed, "
"(d) the expected answer is trivially derivable from the question text. "
"Score 1.0 = keep, 0.0 = reject. Output a single number and one-line reason."
),
},
)
3. Human spot-check. Sample 5 percent of survivors at random and label by hand. Mark plausibility and label correctness. Reject the entire batch if plausibility drops below 0.85. The 5 percent sample is cheap; the alternative is shipping bias the team did not notice.
4. Distribution calibration. Compare the surviving synthetic distribution against the real reference on unigram, bigram, embedding centroid, and cluster count. The synthetic distribution should sit within 1 standard deviation of the real on every metric. Material drift is a regenerate-with-different-seed signal.
The 10x-generate / 10%-keep ratio is the production discipline. Most teams generate 1x and keep 100 percent because the filter feels expensive. The filter pays for itself the first time it catches a 30-percent-wrong batch before it contaminates the eval set.
Calibration: synthetic versus real
The acid test is utility. Train or fine-tune on synthetic; measure performance on a held-out real set. If synthetic helps, it is good synthetic. If synthetic hurts, the set is overfitting to generator artifacts.
Three calibration checks before the synthetic set goes into a training run:
- Distribution match. Compute the centroid distance between synthetic and real embeddings. Above 0.15 cosine distance is a regenerate signal; the synthetic set is in a different region of the input space than production.
- Label calibration. If labels are LLM-generated, sample 5 percent and re-label with humans. Inter-rater reliability below kappa 0.6 means the LLM-labelled set is encoding bias the team is not tracking.
- Held-out utility. Train a small classifier on synthetic-only, then measure performance on a real-only held-out set. If synthetic-trained accuracy on real is materially lower than synthetic-trained accuracy on synthetic, the generator hallucinated a world the model now optimizes for.
The Self-Instruct paper that bootstrapped Alpaca’s 52K instructions ran all three checks; the LIMA paper that beat much larger sets with 1K curated examples ran them more aggressively still. Curation discipline is the lever that turns 10K raw generations into a 1K set worth a $50K fine-tune run.
Three use cases, three filter intensities
Eval set. Highest filter intensity. 200-500 survivors per route, hostile-evaluator threshold 0.8 or higher, human spot-check 10 percent. The eval set has to be right or the CI gate becomes theatre. The synthetic-test-data approach covers the gate-grade construction in more depth.
Fine-tune set. Mid filter intensity. 5K-50K survivors, hostile-evaluator threshold 0.7, human spot-check 1-2 percent. Volume matters more than for an eval set because the gradient averages noise. But unfiltered fine-tune sets ship stylistic bias the production model inherits and never sheds. The synthetic-data-for-fine-tuning post covers volume-vs-quality tradeoffs at the fine-tune scale.
Red-team set. Filter intensity inverts. Generate aggressively with hostile-evolution operators, but the “reject” criterion in the filter is “this is not a realistic attack pattern” rather than “this is too hard.” You want hard. The 8 sub-10ms Scanners in the ai-evaluation SDK (JailbreakScanner, CodeInjectionScanner, SecretsScanner, MaliciousURLScanner, InvisibleCharScanner, LanguageScanner, TopicRestrictionScanner, RegexScanner) catch the deterministic adversarial patterns; the LLM judge catches the rest. The red-teaming step-by-step guide covers the operator design.
The hidden risk: judge contamination
If the generator and the judge come from the same model family, the eval is measuring stylistic alignment to the teacher rather than task quality. This is the failure mode that ruins fine-tuning evaluations and almost no public guide names it.
The mechanism: GPT-5 generates synthetic training data. The team fine-tunes the production model on that data. The fine-tune now produces outputs in GPT-5’s dialect. The team then evaluates with a GPT-5-based judge. The judge scores the fine-tuned model higher than the un-fine-tuned baseline, but the lift is stylistic alignment to GPT-5, not task improvement. The team ships, real users notice nothing improved, and the synthetic-data project looks like a win on the dashboard and a wash in production.
Three disciplines that prevent contamination:
- Cross-family judge. Generator and judge come from different model families. Claude generates, GPT or Gemini judges, or rotate per batch.
- Human-labeled hold-out. A 100-200 case set labeled by humans, sampled from real users, that the judge has never seen. The judge’s score on this hold-out is the calibration anchor; if it drifts, the judge is no longer trustworthy.
- Independent eval pipeline. The fine-tune evaluation runs against a real-user-sourced eval set and a human-labeled hold-out, not against synthetic-data-derived examples. The synthetic set teaches the model; it does not grade the model.
Most synthetic-data project post-mortems trace failure to one of these three. The fix is cheap; the discipline is the work.
How Future AGI fits a synthetic-data workflow
The Future AGI Platform ships a synthetic data agent in the EE tier that runs the ten-dimension taxonomy, parallel plan execution, and validation-and-repair loop the production version of this post describes. The pipeline phases (input validation → planning → batch generation → quality validation → diversity evaluation → distribution correction) match the three patterns and four-stage filter described above. Schema-only, reference-data, and knowledge-base modes cover the three seed strategies.
The eval stack package handles the filter and the calibration:
- ai-evaluation SDK (Apache 2.0).
from fi.evals import Evaluator, Protect.CustomLLMJudgeis the hostile-evaluator filter (a different model family from the generator via LiteLLM swap). 72EvalTemplateclasses for downstream task scoring (Groundedness,ContextAdherence,FactualAccuracy,Completeness,Toxicity,PromptInjection,TaskCompletion,EvaluateFunctionCalling). 8 sub-10ms Scanners for adversarial-set safety filtering. 13 guardrail backends (9 open-weight including LLAMAGUARD_3_8B/1B, QWEN3GUARD_8B/4B/0.6B, GRANITE_GUARDIAN_8B/5B, WILDGUARD_7B, SHIELDGEMMA_2B; 4 API). Four distributed runners (Celery, Ray, Temporal, Kubernetes) for grading 100K+ candidates without melting a single worker.AutoEvalpipelines build the rubric from a natural-language brief. - Future AGI Platform. The in-product agent authors unlimited custom evaluators from natural-language description; describe a rubric, the agent writes the grading prompt plus reference examples, you grade the synthetic set with it. Self-improving evaluators retune from thumbs feedback so the filter ages with the product rather than decaying. Classifier-backed scoring runs at lower per-eval cost than Galileo Luna-2, which is what makes a 100K-candidate filter pass financially viable rather than a quarterly conversation.
- Error Feed (inside the eval stack). HDBSCAN soft-clustering over ClickHouse-stored embeddings identifies which failure modes the synthetic set should next target. A Claude Sonnet 4.5 Judge agent writes the
immediate_fixand the failure modes feed back into the synthetic-data brief, so the next generation pass closes the gap the last pass left open. - traceAI (Apache 2.0). 50+ AI surfaces across Python (46 packages) / TypeScript (39) / Java (24) / C#. Useful for tracking which synthetic inputs the production model actually fails on once shipped, so the next generation cycle targets the live failure modes rather than imagined ones.
The eval stack is the filter; the synthetic data agent is the generator; the closed loop between them sharpens the set quarter over quarter rather than letting it decay.
pip install ai-evaluation, build the persona/taxonomy/evolution generator from the patterns above, point CustomLLMJudge at a cross-family judge model, run the four-stage filter on every batch. The 10 percent that survives is the regression suite you ship.
Related reading
- The Definitive Guide to Synthetic Data Generation (2026)
- Synthetic Test Data for LLM Evaluation (2026)
- Synthetic Data for Fine-Tuning LLMs
- Build an LLM Evaluation Framework From Scratch (2026)
- The 2026 LLM Evaluation Playbook
- Red Teaming LLMs: A Step-by-Step Guide (2026)
- LLM Eval Edge Cases & Adversarial Inputs (2026)
Frequently asked questions
What is the single biggest mistake teams make generating synthetic data with LLMs?
Why not just use the same model to generate and judge?
What are the three generation patterns that actually work in 2026?
How big a synthetic set is enough?
How do I detect mode collapse before it ships?
Can synthetic data contaminate the judge in production?
What does Future AGI ship for synthetic-data workflows?
The definitive 2026 reference: three generation patterns (persona, taxonomy-stratified, evolution), the filter that survives, calibration against real, and three use cases.
Chatbot eval is six stacked problems: intent, retrieval, generation, tool use, multi-turn coherence, and safety. One Groundedness score hides the failure mode that actually ships.
A senior-engineer guide to LLM guardrails: placement, the 9 open-weight and 4 API backends, latency budgets, ensembles, and the precision/recall split that actually catches harm.