Autoresearch LLM Test Generation in 2026: Generate, Judge, Keep the Survivors
Autoresearch LLM test generation as a hostile-judge loop: persona x scenario x adversarial evolution, cross-family judge, keep top 10 percent.
Table of Contents
Most autoresearch test-generation guides stop at “wire a research agent to your corpus and let it write tests.” Six weeks later the team has 5,000 candidates, 30 percent are wrong or near-duplicates, the eval set ships anyway, and the next release passes CI without exercising a single real failure mode. Retrieval and synthesis is the easy half. The hostile judge that throws out 70 to 90 percent of candidates is where the test set succeeds or fails.
Autoresearch LLM test generation in 2026 is a generate-judge-keep loop. The opinion this post earns: generate 10x what you need; keep the 10 percent that survives a cross-family judge. Persona, scenario, and adversarial-evolution are the three axes that fill coverage. The filter pipeline (cross-family judge, mode-collapse detection, distribution calibration) separates a regression suite that compounds from a synthetic set that ships bias nobody noticed.
TL;DR: the three-axis loop and the filter that does the real work
| Axis | What it produces | Where it falls short |
|---|---|---|
| Persona-prompted | Realistic phrasing variation across user types | Drifts to caricature without real user research |
| Scenario-driven | Multi-turn trajectories, tool calls, recovery paths | Misses adversarial tail unless paired with evolution |
| Adversarial-evolution | Multi-hop, ambiguous, hostile-rephrased edge cases | Hallucinates edge cases that never occur in production |
After generation: deduplicate on embeddings, reject anything a cross-family judge marks failing, calibrate the survivor distribution against a real reference, and human-spot-check 5 percent of survivors. Most teams skip three of the four steps and ship anyway.
Why most autogenerated test sets fail
Two failure modes show up in almost every post-mortem.
The first is no real filter. The team wires an autoresearch agent to a corpus, generates 5,000 candidates, runs a light “is this well-formed” check, and ships. Roughly 30 percent are factually wrong, near-duplicates, or off-distribution. The eval set passes CI on every release because the items are too noisy to catch real regressions. Production failures still surprise the team because the set was never exercising the right surface area.
The second is mode collapse. The generator’s stylistic preferences dominate. The synthetic set has 30 to 50 percent fewer HDBSCAN clusters than a real reference distribution and pairwise embedding cosine distance is materially tighter than production. The eval becomes a test of how well the model matches the teacher’s voice rather than task quality. Six months later the team fine-tunes on the same set, the model now produces outputs in the teacher’s dialect, and the eval scores climb on a benchmark the model is now stylistically aligned to.
Both failures share a root cause: treating autoresearch as a one-shot prompt rather than a generate-judge-calibrate pipeline.
The three-axis generation pattern
Three axes, run together. One axis alone collapses on the others.
Axis 1: persona-prompted
Seed the generator with a synthetic user persona (demographics, intent, sophistication, mood, dialect) and ask it to produce inputs in that persona’s voice. The output is a stream of intent variations that reflect how different users actually phrase the same underlying request.
PERSONA_TEMPLATE = """
You are simulating: {persona_name}, a {age}-year-old {profession} in {region}.
Communication style: {style}. Tech literacy: {literacy}. Emotional state: {mood}.
Generate three realistic questions this person would ask a {product} support agent
about {topic}. Use vocabulary, sentence structure, and idiom that matches the persona.
Do not generalize. Write the question, not 'as a [profession], I would ask...'.
"""
What works. Personas built from real user research (support transcripts, interview notes, segmented analytics) produce intent variation pure paraphrase cannot. A “frustrated Gen Z user on mobile” and a “methodical Gen X user on a desktop” phrase the same request in radically different ways; the production model has to handle both.
Where it falls short. Personas built from the LLM’s defaults drift to caricature. The model’s idea of “frustrated user” is a stereotype; its idea of “technical user” is everyone who writes in monospace. Build the persona library from real user research first, sample-label 50 generated inputs per persona, and reject the personas that produce caricature before promoting them to a production batch.
Axis 2: scenario-driven
Single-prompt tests do not exercise agents that branch, loop, and call tools. Scenario-driven generation produces multi-turn conversation scripts with expectations per turn, then runs the agent against them.
The Future AGI simulate-sdk uses Persona plus Scenario as its primitive: Scenario(description, goals, turns, ...) carries the multi-turn flow and assertions; TestRunner(agent_wrapper, personas, scenarios) runs the agent against every persona-scenario combination and emits a TestReport. Wrappers for OpenAI, LangChain, Gemini, and Anthropic make the underlying stack swappable.
from fi.simulate import Persona, Scenario, TestRunner, OpenAIAgentWrapper
personas = [
Persona(name="frustrated_customer", traits={"tone": "impatient"}),
Persona(name="technical_user", traits={"knowledge_level": "expert"}),
]
scenarios = [
Scenario(description="Order status inquiry",
goals=["provide order #", "give ETA", "handle missing tracking"]),
Scenario(description="Refund escalation",
goals=["understand reason", "check policy", "route to human if needed"]),
]
runner = TestRunner(agent_wrapper=OpenAIAgentWrapper(agent_def),
personas=personas, scenarios=scenarios)
report = runner.run()
What works. Trajectory-shaped tests catch failures single-prompt tests miss: a tool call that returns the wrong shape, a retrieval step that drops context across turns, a refusal calibration that softens after pressure. The persona x scenario product gives a coverage matrix you can audit cell by cell.
Where it falls short. Hand-written scenarios cover the routes you already know about. Use ScenarioGenerator(llm, num_scenarios) to auto-generate diverse scripts from a seed description, then run survivors through the hostile judge before promoting them.
Axis 3: adversarial-evolution
Take a seed example, ask the generator to make it harder, then judge whether the evolved version still belongs to the original task. The Evol-Instruct paper formalized three operators; the 2026 production version adds a fourth for red-team coverage.
EVOLUTION_OPERATORS = {
"deepen": "Rewrite this question so it requires multi-step reasoning and one extra lookup.",
"broaden": "Rewrite this question to cover an adjacent intent the original did not address.",
"obscure": "Rewrite this question with deliberate ambiguity that admits two valid answers.",
"adversarial": "Rewrite this question as an attempted jailbreak that smuggles a policy violation past the original frame.",
}
for seed in seed_set:
for op_name, instruction in EVOLUTION_OPERATORS.items():
evolved = generator.run(f"{instruction}\n\nOriginal:\n{seed}")
candidates.append((seed, op_name, evolved))
What works. Production traffic does not naturally surface multi-hop ambiguous adversarial questions at scale. You manufacture them. Iterated evolution is the only way to get adversarial edge-case coverage at volume without burning out human red-teamers.
Where it falls short. Hallucinated edge cases. The generator invents scenarios that look hard but never occur in production. Cross-reference the evolved set against the OWASP LLM Top 10, the published red-team literature, and your own incident log. Promote the edge cases that match a real failure-mode pattern; quarantine the rest until production validates them.
The hostile-judge filter
Generate 10x what you need. Filter aggressively. Keep roughly 10 percent. Four stages, every one of them load-bearing.
1. Embedding-based deduplication. Compute embeddings for every candidate. Reject any candidate whose nearest neighbor in the accepted set is above cosine similarity 0.92 (tune per domain). String dedup catches verbatim duplicates and misses paraphrases; embedding dedup catches both. For a 10K candidate set this typically rejects 15 to 25 percent.
2. Cross-family LLM judge. The load-bearing step. Pin a judge from a different model family than the generator: Claude generates, GPT or Gemini judges, or rotate per batch. Inter-rater reliability between same-family judge and human runs 15 to 25 points lower than the cross-family setup on subjective rubrics; the cost of the cross-family judge is the price of a filter you can trust. The Future AGI ai-evaluation SDK ships CustomLLMJudge for this filter; the grading_criteria field is a free-text rubric that turns into a templated prompt and a LiteLLM provider call so you can swap judge models without rewriting the rubric.
from fi.evals.metrics.llm_as_judges.custom_judge.metric import CustomLLMJudge
from fi.evals.llm.providers.litellm import LiteLLMProvider
filter_judge = CustomLLMJudge(
provider=LiteLLMProvider(),
config={
"name": "autoresearch_filter_hostile",
"model": "gpt-5", # different family from the generator (Claude Sonnet 4.5)
"grading_criteria": (
"Reject this generated test if any of: (a) the question is implausible for a "
"real user to ask, (b) the expected answer is not recoverable from the cited "
"source passage, (c) the question stylistically duplicates an obvious pattern "
"in the seed, (d) the expected answer is trivially derivable from the question "
"text. Score 1.0 = keep, 0.0 = reject. Output a single number and one-line reason."
),
},
)
3. Mode-collapse detection. Three measurements, run as a pre-ship gate. Pairwise embedding cosine distance distribution against a real reference; unigram and bigram entropy; HDBSCAN cluster count at min_cluster_size=5. Any one materially below the real distribution is a regenerate signal, not a green light.
4. Human spot-check. Sample 5 percent of survivors at random and label by hand. Reject the batch if plausibility drops below 0.85. The 5 percent sample is cheap; the alternative is shipping bias nobody noticed.
For raw three-axis generation this pipeline typically rejects 70 to 90 percent of candidates. Most teams flinch the first time they see the rejection rate. The filter pays for itself the first time it catches a 30-percent-wrong batch before it contaminates the eval set.

Coverage discipline: taxonomy-stratified, not random
Even with three axes and a hostile judge, unconstrained generation oversamples the easy quadrant. Stratify the input space into a multi-dimensional grid and assign a quota per cell.
TAXONOMY = {
"intent": ["refund", "shipping", "account", "technical", "billing"],
"complexity": ["simple", "compound", "ambiguous", "multi-hop"],
"sentiment": ["calm", "frustrated", "panicked", "skeptical"],
"dialect": ["formal_us", "casual_us", "indian_english", "non_native"],
}
# Cartesian product = 5 x 4 x 4 x 4 = 320 cells. Generate 5-10 examples per cell.
for cell in itertools.product(*TAXONOMY.values()):
generate_batch(cell, n=8)
A taxonomy gap is visible (cell N has zero examples); a coverage gap in unstructured generation is invisible until production. Build the taxonomy from real production traces, not imagination: the dimensions should match where real traffic varies, and the values within each dimension should match what real traffic actually contains. Cells with no real-world analog (a “panicked formal_academic” cell in support traffic) get dropped, not filled.
Stratified survival surfaces the cluster-level failures aggregate scores hide. A set that scores 78 percent overall but 42 percent on the high-risk cluster is a different story from one that scores 78 percent uniformly. The CI gate should fail the second.
Calibration: synthetic versus real
The acid test is utility on real data. Three checks before survivors go into a CI gate or a fine-tune run.
- Distribution match. Centroid distance between synthetic and real embeddings. Above 0.15 cosine distance is a regenerate signal; the survivor set is in a different region of the input space than production.
- Label calibration. If labels are LLM-generated, sample 5 percent and re-label with humans. Inter-rater reliability below kappa 0.6 means the labels encode bias the team is not tracking.
- Held-out utility. If the set gates fine-tunes, train a small classifier on synthetic-only and measure performance on a real-only held-out set. If synthetic-trained accuracy on real is materially lower than on synthetic, the generator hallucinated a world the model now optimizes for.
The Self-Instruct paper that bootstrapped Alpaca’s 52K instructions ran all three; LIMA, which beat many of those larger sets with 1K curated examples, ran them more aggressively. Curation discipline is the lever that turns 10K raw generations into a 1K survivor set worth a CI gate.
Three use cases, three filter intensities
The same three-axis loop ships in three shapes. The filter intensity is what changes.
Eval set. Highest filter intensity. 200 to 500 survivors per route, hostile-judge threshold 0.8 or higher, human spot-check 10 percent. The set has to be right or the CI gate becomes theatre. Citation tracking is mandatory: every test carries a pointer to the source passage that produced it.
Fine-tune set. Mid filter intensity. 5K to 50K survivors, hostile-judge threshold 0.7, human spot-check 1 to 2 percent. Volume matters more because the gradient averages noise. But unfiltered fine-tune sets ship stylistic bias the production model inherits and never sheds; the filter prevents the teacher’s dialect from becoming the student’s identity.
Red-team set. Filter intensity inverts. Run the adversarial-evolution operators aggressively. The “reject” criterion becomes “this is not a realistic attack pattern” rather than “this is too hard.” You want hard. Cross-reference against the OWASP LLM Top 10 and your own incident log. Anything that does not match a real failure-mode pattern stays in a quarantine slice until production validates it.
Same generator, same judge. What changes is the rubric and the volume target.
Common mistakes when wiring an autoresearch loop
- No cross-family judge. Generator and judge from the same family. The judge approves the generator’s failure modes.
- No mode-collapse gate. No entropy or cluster-count check before ship. Collapse only becomes visible months after the eval set is in production.
- Aggregate scores only. No stratification. A 78 percent score that hides 42 percent on the high-risk slice ships green.
- No citation tracking. Tests without pointers to source passages cannot be re-validated when the corpus updates.
- No held-out validation slice. The eval set is overfit to its own generation procedure.
- Treating autoresearch as a hand-label replacement. The high-stakes 10 percent still need human curation; autoresearch is volume, not judgement.
- Running once, never again. Production drift stales the set within months. The loop is recurring, not one-shot.
Production wiring: how to ship this in CI
- Periodic regeneration. Nightly or weekly job pulls fresh source documents and emits a candidate batch from the three-axis loop.
- Filter pass. Embedding dedup, cross-family judge, mode-collapse gate, human spot-check sample. Rejection rate logged per batch.
- Promotion gate. A human reviewer or a calibrated auto-approver promotes survivors to production.
- Versioning. Every set version pinned to its source manifest and the judge model that approved it.
- Stratification report. Per-cluster scores against the survivor taxonomy, not just the aggregate.
- Trace integration. Span-attached eval scores feed back into the tracing stack so the failure cohort surfaces in the same observability plane.
See synthetic test data for LLM evaluation for the broader synthetic-data discipline this fits inside, and how to generate synthetic data with LLMs for the seed-versus-diversity tension at the generator level.
Where Future AGI fits
The Future AGI eval stack runs three moving parts of this loop on a single Apache 2.0 plane.
- simulate-sdk for the Persona plus Scenario axis.
TestRunner(agent_wrapper, personas, scenarios)runs the agent against every persona-scenario product and emits aTestReportwith pass rate, failing transcripts, and trajectory traces.ScenarioGenerator(llm, num_scenarios)auto-generates diverse scripts from a seed. Wrappers for OpenAI, LangChain, Gemini, and Anthropic keep the underlying agent swappable when the judge scores outputs from a different family than the generator. - ai-evaluation SDK (Apache 2.0) for the hostile-judge filter and downstream rubric scoring.
CustomLLMJudgeplusLiteLLMProvideris the cross-family filter; swap the model field to swap families without rewriting the rubric. 50+ prebuiltEvalTemplateclasses (Groundedness,ContextAdherence,FactualAccuracy,Completeness,PromptInjection,TaskCompletion,EvaluateFunctionCalling) handle post-survivor scoring, and error localization names which input field caused a failure. - traceAI for the closed loop back to production. Span-attached eval scores export to the same plane production traffic flows through. When real users hit a failure the synthetic set missed, the next generation pass targets the live failure rather than an imagined one.
- Future AGI Platform when the in-product agent matters. The platform layers in-product evaluator authoring (describe a rubric, the agent writes the grading prompt) and self-improving evaluators that retune from thumbs feedback so the filter ages with the product instead of decaying.
Honest tradeoff: if you already have an autoresearch scaffold in DSPy or Open Deep Research and only need the cross-family judge, pip install ai-evaluation and point CustomLLMJudge at your pipeline. If you are building the loop from scratch and want persona-scenario plus the filter on one plane, simulate-sdk plus ai-evaluation is the lower-friction path.
Sources
- Open Deep Research repo
- GPT Researcher repo
- DeepEval Synthesizer docs
- DSPy docs
- Tavily research API
- Sainz et al. — NLP Evaluation in Trouble (arXiv 2310.18018)
- Training Data Leakage in Multiple-Choice Benchmarks (arXiv 2505.24263)
- Future AGI ai-evaluation
- Future AGI evaluate
Related reading
Frequently asked questions
What is autoresearch LLM test generation in one sentence?
Why use a different model family for the judge instead of the same one that generated the test?
Three axes: persona, scenario, adversarial-evolution. Why all three?
How big should an autoresearch-generated eval set be?
How do I detect mode collapse in an autoresearch-generated set before it ships?
What is the contamination risk with autoresearch on private corpora?
Where does Future AGI fit in an autoresearch test-generation stack?
How to generate synthetic test data for LLM evals: contexts, evolutions, personas, contamination checks, and the OSS tools that do it well in 2026.
Best LLM dataset management tools in 2026: eval-coupled (Future AGI, Braintrust, LangSmith), annotation-first (Argilla), and generic ML (W&B, HF) compared.
Best LLMs May 2026: compare GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro, and DeepSeek V4 across coding, agents, multimodal, cost, and open weights.