Guides

How to Generate Synthetic Data Using LLMs (2026)

Generate 10x what you need and keep the 10% that survives the filter. Persona, taxonomy, and hostile-evolution patterns plus the rejection pipeline most posts skip.

·
13 min read
synthetic-data llm-training data-augmentation ai-evaluation 2026
Editorial cover image for How to Generate Synthetic Data Using LLMs
Table of Contents

Most posts on synthetic data stop at “ask the teacher model to generate examples.” Six months later the team has a fine-tune that overfit to GPT-5’s voice, an eval set that scores high on a benchmark the model is now stylistically aligned to, and a regression suite that catches none of the failures real users hit. The teacher model is the easy half. The hostile filter that throws out the 30 percent of raw generation that is wrong, near-duplicate, or stylistically collapsed is where the synthetic-data project succeeds or fails.

The opinion this post earns: generate 10x more than you need; keep the 10 percent that survives the filter. Persona, taxonomy, and hostile-evolution are the three patterns worth investing in. The filter pipeline (dedupe, hostile judge, human spot-check) is what separates a regression suite that compounds from a synthetic set that ships bias the team never noticed.

TL;DR: three patterns and a filter that does the actual work

PatternWhat it producesWhen to use
Persona-promptedRealistic intent variationsChatbot, support-bot, conversational eval sets
Taxonomy-stratifiedQuota-balanced coverage of input spaceClassification, intent detection, RAG eval sets
Hostile-evolutionAdversarial and edge-case inputsRed-team sets, robustness training, jailbreak suites

Then the filter: deduplicate on embeddings, reject anything a hostile judge marks failing, human-spot-check 5 percent of survivors, calibrate the surviving distribution against a real reference set. Most teams skip three of the four steps and ship anyway.

When synthetic data works (and when it does not)

Synthetic data is a complement to real data, never a replacement. The cases where it earns its keep:

  • Cold start. A new product has no production traffic. Synthetic is the only way to assemble an eval set before launch.
  • Rare classes. The hardest 10 percent of cases drive most eval signal, and that 10 percent is rare by definition. Synthetic concentrates on the tail.
  • Regulated data. Medical records, financial documents, and PII cannot be used directly even when you have them. Synthetic that preserves structure without preserving identity is the legal-compliant proxy.
  • Adversarial coverage. Real production traffic does not naturally surface jailbreaks and prompt injections at the volume needed to test defenses. You manufacture them.

The cases where it fails:

  • Distribution mismatch. The synthetic distribution does not match production. The fine-tune trains on a world it never sees.
  • Mode collapse. The generator’s stylistic preferences dominate. The synthetic set has 30 percent the latent diversity of the real reference and the eval becomes a test of how well the production model matches the teacher’s voice.
  • Judge contamination. Generator and judge come from the same model family. The judge approves exactly the failure modes the generator produced.
  • No filter. Raw generation ships unfiltered. Roughly 30 percent of those examples are factually wrong, near-duplicates, or off-distribution; the team trains on noise.

The four failure modes share a root cause: treating synthetic generation as a one-shot prompt rather than a generate-filter-calibrate pipeline. The next sections walk the pipeline.

Pattern 1: persona-prompted generation

Seed the generator with a synthetic user persona (demographics, intent, sophistication, mood, dialect) and ask it to produce inputs in that persona’s voice. The output is a stream of intent variations that reflect how different users actually phrase the same underlying request. Anthropic’s Constitutional AI work and OpenAssistant Conversations both lean heavily on persona-style prompting to scale beyond what the source data supports.

PERSONA_TEMPLATE = """
You are simulating: {persona_name}, a {age}-year-old {profession} in {region}.
Communication style: {style}. Tech literacy: {literacy}. Emotional state: {mood}.

Generate three realistic questions this person would ask a {product} support agent
about {topic}. Use vocabulary, sentence structure, and idiom that matches the persona.
Do not generalize. Do not write 'as a [profession], I would ask...'; write the question.
"""

Strengths. Produces intent variation pure paraphrase cannot. A “frustrated Gen Z user on mobile” and a “methodical Gen X user on a desktop” phrase the same support request in radically different ways; the model needs to handle both.

Weaknesses. Personas drift toward stereotypes. The LLM’s idea of “frustrated user” is a caricature; its idea of “technical user” is everyone who writes in monospace. Curated persona sets need human review or they encode the teacher model’s prejudices.

Mitigation. Build personas from real user research (support transcripts, interview notes, segmented analytics) rather than the LLM’s defaults. Reject persona outputs that read as caricature in the hostile-evaluator pass. Sample-label 50 generated inputs per persona and gate on plausibility before promoting that persona to a production batch. The Future AGI platform’s synthetic data agent ships a ten-dimension taxonomy spanning demographic, behavioral, linguistic, temporal, and emotional axes for exactly this reason; persona breadth is the lever that prevents collapse.

Pattern 2: taxonomy-stratified generation

Slice the input space into a multi-dimensional grid (intent x complexity x sentiment x dialect x edge-case-flag) and generate a quota per cell. The grid is the worldview: each axis encodes a dimension you care about, and the cell coverage is the contract. Unconstrained “generate 1000 support queries” tends to over-sample the easy quadrant; stratified generation forces the generator to fill the hard cells.

TAXONOMY = {
    "intent": ["refund", "shipping", "account", "technical", "billing"],
    "complexity": ["simple", "compound", "ambiguous", "multi-hop"],
    "sentiment": ["calm", "frustrated", "panicked", "skeptical"],
    "dialect": ["formal_us", "casual_us", "indian_english", "non_native"],
}

# Cartesian product = 5 x 4 x 4 x 4 = 320 cells. Generate 5-10 examples per cell.
for cell in itertools.product(*TAXONOMY.values()):
    intent, complexity, sentiment, dialect = cell
    generate_batch(cell, n=8)

Strengths. Coverage you can audit. A taxonomy gap is visible (cell N has 0 examples); a coverage gap in unstructured generation is invisible until production. The Microsoft Self-Instruct work used a coarser version of this approach to scale from 175 seed instructions to 52K filtered survivors with measurably broader coverage than unconstrained generation.

Weaknesses. Cell explosion. A 5-axis taxonomy with 5 values per axis is 3,125 cells. At 8 examples per cell, 25,000 raw generations. The taxonomy needs ruthless pruning before launch; cells whose combinations are implausible (a “panicked formal_academic” cell rarely occurs in real support traffic) should be dropped, not filled.

Mitigation. Build the taxonomy from real production traces, not from imagination. The dimensions you stratify on should be the dimensions where real production traffic shows variation; the values within each dimension should be the values real traffic actually contains. After generation, check that the cell-occupancy distribution roughly matches real production frequency; cells with no real-world analog get dropped.

Pattern 3: hostile-evolution

Take a seed example and ask the generator to make it harder. The Evol-Instruct paper formalized this with three operators (deepening, broadening, elimination); the 2026 production version adds a fourth: hostile rephrasing for red-team sets.

EVOLUTION_OPERATORS = {
    "deepen":     "Rewrite this question so it requires multi-step reasoning and one extra lookup.",
    "broaden":    "Rewrite this question to cover an adjacent intent the original did not address.",
    "obscure":    "Rewrite this question with deliberate ambiguity that admits two valid answers.",
    "adversarial": "Rewrite this question as an attempted jailbreak that smuggles a policy violation past the original frame.",
}

for seed in seed_set:
    for op_name, instruction in EVOLUTION_OPERATORS.items():
        evolved = generator.run(f"{instruction}\n\nOriginal:\n{seed}")
        candidates.append((seed, op_name, evolved))

Strengths. This is where synthetic data has the biggest leverage. Production traffic does not naturally surface multi-hop ambiguous adversarial questions at scale; you have to manufacture them. Anthropic’s red-team work on Claude and OpenAI’s adversarial training pipelines both rely on iterated evolution to scale beyond what human red-teamers can produce.

Weaknesses. Hallucinated edge cases. The generator invents scenarios that look hard but never occur in production. Time spent fixing imaginary bugs is not time spent fixing real bugs. The “elimination” operator (drop any evolved example the generator cannot itself answer in three retries) catches some of this; the hostile filter in the next section catches more.

Mitigation. Cross-reference the evolved set against the OWASP LLM Top 10, the published red-team literature, and your own incident log. Promote only the edge cases that match a real failure mode pattern. Treat the rest as exploratory and quarantine them outside the regression suite until production validates the failure mode is real.

The seed-versus-diversity tension

Every synthetic-data run has a seed. The seed is a tiny set of high-quality real examples that anchor the generator. The seed is also the gravity well that pulls the generator toward a narrow region of the input space.

Too little seed: the generator drifts. The synthetic distribution wanders away from production and the eval becomes a benchmark on an imagined world.

Too much seed: the generator collapses. The synthetic distribution mirrors the seed’s idiosyncrasies (the same dialects, the same complexity profile, the same handful of intents) and the diversity gain is illusory.

The 2026 working pattern: start with 50-200 high-quality real examples as seed, ask the generator to produce 5-10x that with explicit instructions to vary along axes the seed under-represents, then measure diversity against the real reference distribution. Three metrics:

  1. Pairwise cosine distance. Embed every example with a consistent model (text-embedding-3-large, multilingual-e5-large). The synthetic distribution’s pairwise distance histogram should be at least as wide as the real reference. Tighter means mode collapse.
  2. Unigram and bigram entropy. Tokenize, count, compute entropy. Synthetic sets that collapsed onto the generator’s voice score 15-30 percent lower entropy than the real reference.
  3. Cluster count under HDBSCAN. Cluster the embeddings with min_cluster_size=5. Collapsed sets produce 30-50 percent fewer clusters than the real distribution.

Any one failing is reason to regenerate with a different seed strategy or a multi-model mix. The Future AGI synthetic data agent runs all three as gates inside its diversity-evaluation phase; the goal is to flag the collapse before the set ships.

The filter pipeline: the work most posts skip

Generate 10x more than you need. Filter aggressively. Keep roughly 10 percent.

The pipeline has four stages, every one of them load-bearing:

1. Embedding-based deduplication. Compute embeddings for every candidate. Reject any candidate whose nearest neighbor in the existing accepted set is above cosine similarity 0.92 (tune per domain). String dedup catches verbatim duplicates and misses paraphrases; embedding dedup catches both. For a 10K candidate set this typically rejects 15-25 percent.

2. Hostile-evaluator LLM judge. A different model family from the generator. The judge scores each candidate against the task definition: faithfulness, intent clarity, plausibility, novelty against the seed. Reject anything below threshold (we use 0.7 on a 0-1 scale). For raw generations this typically rejects an additional 40-60 percent. The Future AGI ai-evaluation SDK ships CustomLLMJudge for exactly this filter; the grading_criteria field is a free-text rubric that turns into a Jinja2-templated prompt and a LiteLLM provider call so you can swap judge models without rewriting the rubric.

from fi.evals.metrics.llm_as_judges.custom_judge.metric import CustomLLMJudge
from fi.evals.llm.providers.litellm import LiteLLMProvider

filter_judge = CustomLLMJudge(
    provider=LiteLLMProvider(),
    config={
        "name": "synthetic_filter_hostile",
        "model": "gpt-5",  # Different family from the generator (Claude Sonnet 4.5)
        "grading_criteria": (
            "Reject this generated example if any of: (a) the question is implausible "
            "for a real customer to ask, (b) the question is stylistically collapsed onto "
            "a single voice, (c) the question duplicates an obvious pattern in the seed, "
            "(d) the expected answer is trivially derivable from the question text. "
            "Score 1.0 = keep, 0.0 = reject. Output a single number and one-line reason."
        ),
    },
)

3. Human spot-check. Sample 5 percent of survivors at random and label by hand. Mark plausibility and label correctness. Reject the entire batch if plausibility drops below 0.85. The 5 percent sample is cheap; the alternative is shipping bias the team did not notice.

4. Distribution calibration. Compare the surviving synthetic distribution against the real reference on unigram, bigram, embedding centroid, and cluster count. The synthetic distribution should sit within 1 standard deviation of the real on every metric. Material drift is a regenerate-with-different-seed signal.

The 10x-generate / 10%-keep ratio is the production discipline. Most teams generate 1x and keep 100 percent because the filter feels expensive. The filter pays for itself the first time it catches a 30-percent-wrong batch before it contaminates the eval set.

Calibration: synthetic versus real

The acid test is utility. Train or fine-tune on synthetic; measure performance on a held-out real set. If synthetic helps, it is good synthetic. If synthetic hurts, the set is overfitting to generator artifacts.

Three calibration checks before the synthetic set goes into a training run:

  • Distribution match. Compute the centroid distance between synthetic and real embeddings. Above 0.15 cosine distance is a regenerate signal; the synthetic set is in a different region of the input space than production.
  • Label calibration. If labels are LLM-generated, sample 5 percent and re-label with humans. Inter-rater reliability below kappa 0.6 means the LLM-labelled set is encoding bias the team is not tracking.
  • Held-out utility. Train a small classifier on synthetic-only, then measure performance on a real-only held-out set. If synthetic-trained accuracy on real is materially lower than synthetic-trained accuracy on synthetic, the generator hallucinated a world the model now optimizes for.

The Self-Instruct paper that bootstrapped Alpaca’s 52K instructions ran all three checks; the LIMA paper that beat much larger sets with 1K curated examples ran them more aggressively still. Curation discipline is the lever that turns 10K raw generations into a 1K set worth a $50K fine-tune run.

Three use cases, three filter intensities

Eval set. Highest filter intensity. 200-500 survivors per route, hostile-evaluator threshold 0.8 or higher, human spot-check 10 percent. The eval set has to be right or the CI gate becomes theatre. The synthetic-test-data approach covers the gate-grade construction in more depth.

Fine-tune set. Mid filter intensity. 5K-50K survivors, hostile-evaluator threshold 0.7, human spot-check 1-2 percent. Volume matters more than for an eval set because the gradient averages noise. But unfiltered fine-tune sets ship stylistic bias the production model inherits and never sheds. The synthetic-data-for-fine-tuning post covers volume-vs-quality tradeoffs at the fine-tune scale.

Red-team set. Filter intensity inverts. Generate aggressively with hostile-evolution operators, but the “reject” criterion in the filter is “this is not a realistic attack pattern” rather than “this is too hard.” You want hard. The 8 sub-10ms Scanners in the ai-evaluation SDK (JailbreakScanner, CodeInjectionScanner, SecretsScanner, MaliciousURLScanner, InvisibleCharScanner, LanguageScanner, TopicRestrictionScanner, RegexScanner) catch the deterministic adversarial patterns; the LLM judge catches the rest. The red-teaming step-by-step guide covers the operator design.

The hidden risk: judge contamination

If the generator and the judge come from the same model family, the eval is measuring stylistic alignment to the teacher rather than task quality. This is the failure mode that ruins fine-tuning evaluations and almost no public guide names it.

The mechanism: GPT-5 generates synthetic training data. The team fine-tunes the production model on that data. The fine-tune now produces outputs in GPT-5’s dialect. The team then evaluates with a GPT-5-based judge. The judge scores the fine-tuned model higher than the un-fine-tuned baseline, but the lift is stylistic alignment to GPT-5, not task improvement. The team ships, real users notice nothing improved, and the synthetic-data project looks like a win on the dashboard and a wash in production.

Three disciplines that prevent contamination:

  1. Cross-family judge. Generator and judge come from different model families. Claude generates, GPT or Gemini judges, or rotate per batch.
  2. Human-labeled hold-out. A 100-200 case set labeled by humans, sampled from real users, that the judge has never seen. The judge’s score on this hold-out is the calibration anchor; if it drifts, the judge is no longer trustworthy.
  3. Independent eval pipeline. The fine-tune evaluation runs against a real-user-sourced eval set and a human-labeled hold-out, not against synthetic-data-derived examples. The synthetic set teaches the model; it does not grade the model.

Most synthetic-data project post-mortems trace failure to one of these three. The fix is cheap; the discipline is the work.

How Future AGI fits a synthetic-data workflow

The Future AGI Platform ships a synthetic data agent in the EE tier that runs the ten-dimension taxonomy, parallel plan execution, and validation-and-repair loop the production version of this post describes. The pipeline phases (input validation → planning → batch generation → quality validation → diversity evaluation → distribution correction) match the three patterns and four-stage filter described above. Schema-only, reference-data, and knowledge-base modes cover the three seed strategies.

The eval stack package handles the filter and the calibration:

  • ai-evaluation SDK (Apache 2.0). from fi.evals import Evaluator, Protect. CustomLLMJudge is the hostile-evaluator filter (a different model family from the generator via LiteLLM swap). 72 EvalTemplate classes for downstream task scoring (Groundedness, ContextAdherence, FactualAccuracy, Completeness, Toxicity, PromptInjection, TaskCompletion, EvaluateFunctionCalling). 8 sub-10ms Scanners for adversarial-set safety filtering. 13 guardrail backends (9 open-weight including LLAMAGUARD_3_8B/1B, QWEN3GUARD_8B/4B/0.6B, GRANITE_GUARDIAN_8B/5B, WILDGUARD_7B, SHIELDGEMMA_2B; 4 API). Four distributed runners (Celery, Ray, Temporal, Kubernetes) for grading 100K+ candidates without melting a single worker. AutoEval pipelines build the rubric from a natural-language brief.
  • Future AGI Platform. The in-product agent authors unlimited custom evaluators from natural-language description; describe a rubric, the agent writes the grading prompt plus reference examples, you grade the synthetic set with it. Self-improving evaluators retune from thumbs feedback so the filter ages with the product rather than decaying. Classifier-backed scoring runs at lower per-eval cost than Galileo Luna-2, which is what makes a 100K-candidate filter pass financially viable rather than a quarterly conversation.
  • Error Feed (inside the eval stack). HDBSCAN soft-clustering over ClickHouse-stored embeddings identifies which failure modes the synthetic set should next target. A Claude Sonnet 4.5 Judge agent writes the immediate_fix and the failure modes feed back into the synthetic-data brief, so the next generation pass closes the gap the last pass left open.
  • traceAI (Apache 2.0). 50+ AI surfaces across Python (46 packages) / TypeScript (39) / Java (24) / C#. Useful for tracking which synthetic inputs the production model actually fails on once shipped, so the next generation cycle targets the live failure modes rather than imagined ones.

The eval stack is the filter; the synthetic data agent is the generator; the closed loop between them sharpens the set quarter over quarter rather than letting it decay.

pip install ai-evaluation, build the persona/taxonomy/evolution generator from the patterns above, point CustomLLMJudge at a cross-family judge model, run the four-stage filter on every batch. The 10 percent that survives is the regression suite you ship.

Frequently asked questions

What is the single biggest mistake teams make generating synthetic data with LLMs?
Trusting the first pass. The teacher model returns 10,000 examples that look plausible, the team labels the run a success, and 30 percent of those examples are wrong, near-duplicates, or stylistically collapsed onto a single voice. The synthetic set ships, the fine-tune trains on bias the team never noticed, and the eval scores climb on a benchmark the model is now overfit to. The fix is the discipline that almost no public guide writes down: generate 10x more examples than you need, then filter aggressively with a judge that is not the generator, a deduplicator that operates on embeddings not strings, and a human spot-check on 5 percent of survivors. The output set is roughly 10 percent of the raw generation. That 90 percent rejection rate is the work.
Why not just use the same model to generate and judge?
Because the judge will accept its own failure modes. If GPT-5 generates a question and GPT-5 grades whether the question is faithful, the grader is blind to exactly the artifacts the generator produced: the same vocabulary preferences, the same stereotypical persona renderings, the same shortcut reasoning paths. Inter-rater reliability between same-model judge and human is consistently 15-25 points lower than the cross-model setup on subjective rubrics. Use a different family for the judge (Anthropic generates, OpenAI or Gemini judges) and route the disagreements to a human queue.
What are the three generation patterns that actually work in 2026?
Persona-prompted, taxonomy-stratified, and hostile-evolution. Persona-prompted seeds the generator with a demographically and behaviorally specific user (built from real customer interviews, not the LLM's defaults) and asks it to produce inputs in that user's voice. Taxonomy-stratified slices the input space into a multi-dimensional grid (intent x complexity x sentiment x dialect) and generates a quota per cell, which beats unconstrained prompting on coverage by 30-40 percent in our internal measurements. Hostile-evolution adapts Microsoft's Evol-Instruct: take a seed example, ask the generator to make it harder, ambiguous, multi-hop, or adversarial, then run the survivor through a hostile judge that throws out anything the original task definition no longer covers. Mix all three; using one pattern for everything is how you get mode collapse.
How big a synthetic set is enough?
Depends on the use case. For an eval set, 200-500 high-quality survivors per route beat 5,000 unfiltered. For fine-tuning, the Self-Instruct paper shipped at 52K instructions and Alpaca at 52K; the LIMA paper showed 1K curated examples beat many of those larger sets on instruction-following. The pattern: generate aggressively (10K-100K), filter aggressively (10-20 percent acceptance rate), and ship the surviving curated set. Quality wins by a factor of 10-50 over raw count past the first few hundred examples. The filter is where the win lives.
How do I detect mode collapse before it ships?
Three measurements. (1) Pairwise embedding cosine distance distribution: if the synthetic set is tighter in latent space than the real set for the same intent, the generator collapsed onto a preferred style. (2) Unigram and bigram entropy: synthetic sets with mode collapse score 15-30 percent lower entropy than the real reference set. (3) Cluster count under HDBSCAN at fixed min_cluster_size: collapsed sets produce 30-50 percent fewer clusters than the real distribution. Run all three before signing off; any one failing is reason to regenerate with a different seed strategy or a multi-model mix.
Can synthetic data contaminate the judge in production?
Yes, and this is the failure mode that ruins fine-tuning evaluations. If you fine-tune the production model on synthetic data generated by GPT-5, and you then evaluate that fine-tune with a GPT-5-based judge, the judge will systematically score the fine-tuned model higher because the model now produces outputs in GPT-5's stylistic dialect. The eval is measuring stylistic alignment to the teacher, not task quality. The fix is judge isolation: the judge model family must be different from the generator model family, the human-labeled hold-out set must come from real users not synthetic data, and the calibration check between judge and human runs weekly on traces the judge has never seen. Without this discipline, the synthetic-data win is illusory.
What does Future AGI ship for synthetic-data workflows?
An eval stack package, not a generator. The ai-evaluation SDK (Apache 2.0) ships CustomLLMJudge for the hostile-evaluator filter that rejects 70-90 percent of raw generation, 72 EvalTemplate classes (Groundedness, ContextAdherence, FactualAccuracy, Toxicity, PromptInjection, TaskCompletion, EvaluateFunctionCalling) for downstream scoring, 8 sub-10ms Scanners for adversarial set safety filtering, and four distributed runners (Celery, Ray, Temporal, Kubernetes) for batch grading of 100K+ candidates. The Future AGI Platform layers on an in-product agent that authors unlimited custom evaluators from natural-language description, self-improving evaluators that retune on production feedback, and classifier-backed scoring at lower per-eval cost than Galileo Luna-2 for the high-volume filtering pass. Error Feed sits inside the eval stack to cluster the failures your synthetic set should next target.
Related Articles
View all
The Ultimate Guide to LLM Guardrails (2026)
Guides

A senior-engineer guide to LLM guardrails: placement, the 9 open-weight and 4 API backends, latency budgets, ensembles, and the precision/recall split that actually catches harm.

NVJK Kartik
NVJK Kartik ·
14 min