Research

Automated Prompt Improvement in 2026: Six Optimizers, Six Different Jobs

Automated prompt improvement in 2026 with six named optimizers (ProTeGi, GEPA, PromptWizard, MetaPrompt, BayesianSearch, RandomSearch) wired into CI.

·
Updated
·
12 min read
prompt-optimization protegi gepa promptwizard bayesian-optimization automated-prompting agent-opt 2026
Editorial cover image on a pure black starfield background with faint white grid. Bold all-caps white headline AUTOMATED PROMPT IMPROVEMENT fills the left half. The right half shows a wireframe closed feedback loop with three nodes labeled PROMPT, EVAL, REWRITE and a thicker delta arrow rising upward from the loop pointing to a final IMPROVED PROMPT card. Soft white halo glow on the IMPROVED PROMPT card.
Table of Contents

Manual prompt tuning is craft. Automated prompt improvement is engineering. The six named optimizers shipping in 2026 (RandomSearch, BayesianSearch, MetaPrompt, ProTeGi, GEPA, PromptWizard) are six different jobs, not six versions of the same job, and the difference between a useful optimization pass and a benchmark-hack is picking the right one for your search space, eval signal, and budget.

This post walks through each optimizer, where each fits, how to define the eval that bounds them, and how to wire the loop into CI without shipping a prompt that only beats its own training set. All six are implemented in fi.opt.optimizers (the agent-opt module inside Future AGI’s stack); the names below match the import paths.

When manual prompt tuning hits a wall

You know the wall when you see it. The prompt scores 71% on your eval set after two weeks of edits. The next hand-edit moves the score by 0.4 points. Three engineers each have a different theory about how the system message should be reworded. None of the theories are wrong, none are right, and the team is out of structured ways to make progress.

Three things changed between 2023 and 2026 that turned this wall from a place where you give up into a place where an optimizer takes over. Distilled judge models (small evaluators fine-tuned on calibration data) made per-evaluation cost cheap enough to run thousands of judge calls in a single pass. Reflective search procedures like GEPA dropped the rollout count for a meaningful gain by an order of magnitude. And the prompt-as-structured-object idea (templates with named slots, few-shot blocks as first-class members) gave the optimizer a search space narrower than “any string”.

The result: a prompt optimization pass that used to cost a meaningful fraction of a fine-tuning run now costs less than the engineer-hours it replaces. That’s the economic shift. The technical shift is that the six algorithms below cover meaningfully different jobs, and treating them as interchangeable is the most common mistake.

The six optimizers, what they do, when to use

Every algorithm below has the same shape: take an initial prompt, generate candidates, score each candidate against a dataset using an Evaluator, keep the winner. The difference is how candidates are generated and what the search procedure assumes about the reward signal.

1. RandomSearch: the baseline that pays for itself

RandomSearchOptimizer asks a teacher model (defaults to a strong general-purpose LLM) to produce N variations of the seed prompt, evaluates each on the dataset, and keeps the highest scorer. The signature is small: num_variations, teacher_model, eval_template, and an optional EarlyStoppingConfig.

Why start here. RandomSearch answers the question you actually need answered before reaching for anything more expensive: is the wording of the prompt the bottleneck? If RandomSearch moves the score by 3-5 points, the prompt has headroom and a directed optimizer will move it further. If RandomSearch doesn’t move the score at all, the wording isn’t the issue (the eval, the data, or the model is) and no optimizer in this list will save you.

Use when. You’re starting fresh, you want a baseline, or you’ve inherited a prompt and don’t yet know what range of scores is reachable.

2. BayesianSearch: when each evaluation is expensive

BayesianSearchOptimizer wraps Optuna’s TPE sampler over a constrained search space: how many few-shot examples to include (min_examples to max_examples), which examples to pick, and how to format them. The instruction text itself isn’t mutated; the search space is the few-shot block.

Two features matter in production. First, an optional teacher model can infer the right example_template from a sample of your dataset (infer_example_template_via_teacher=True), so you don’t hand-author a format string. Second, the trials are resumable: pass storage (an Optuna storage URL, like a SQLite or Postgres backend) and study_name, and a run that hits its budget can be resumed later with load_if_exists, accumulating trials across days.

Use when. Each evaluation is expensive (frontier judge, long generations, large dataset) and you can’t afford to throw away trials. The Bayesian sampler converges faster than random for small expensive search spaces; the resumability lets you spread cost across CI runs.

3. MetaPrompt: the senior engineer in a loop

MetaPromptOptimizer is the most human-shaped of the six. Each round, a teacher LLM reads the current prompt, the previous failed attempts, and the annotated results (which examples failed and how badly), then produces two things: a hypothesis (“the prompt is too vague about output format”) and a fully rewritten improved_prompt. The optimizer evaluates the new prompt; if it scores higher, it becomes the current best.

The hypothesis field is the part that earns its keep. Every iteration produces a debuggable artifact: a sentence explaining what the optimizer changed and why. That’s usable in code review and it’s auditable when an output regresses.

Use when. You want one strong rewrite per round rather than a population. Smaller datasets (the default eval_subset_size is 40) and shorter optimization runs (5 rounds is the default). MetaPrompt also fits the case where the prompt is short and the failures are conceptual (wrong reasoning approach, wrong output schema) rather than fine-grained.

ProTeGi is the algorithm that made “text as gradients” stick. Each round has three stages: (1) sample failing examples from the current best prompt; (2) the teacher LLM writes num_gradients distinct critiques explaining why the prompt failed on those examples; (3) for each critique, the teacher produces prompts_per_gradient improved variants. A beam of size beam_size keeps the top candidates across rounds; paraphrase augmentation prevents the beam from collapsing to near-duplicates.

The default config (num_gradients=4, errors_per_gradient=4, prompts_per_gradient=1, beam_size=4) generates roughly 4-8 candidates per round. With 3 rounds and a 32-item eval subset, that’s a few hundred eval calls. Modest.

Use when. Failures cluster into nameable critiques. ProTeGi shines on classification-shaped tasks (intent, sentiment, structured extraction) where “the prompt failed because it didn’t distinguish X from Y” is a writable sentence. It struggles on long-form open-ended generation where critiques are vague.

GEPAOptimizer wraps the external gepa library and exposes a small surface: reflection_model, generator_model, and max_metric_calls. Inside, GEPA runs an evolutionary loop with reflection: a population of prompts is maintained, the reflection model proposes mutations and crossovers based on per-rollout traces, and a Pareto frontier keeps diverse high performers across multiple objectives.

The published GEPA paper (arXiv 2507.19457) reports it outperforming MIPROv2 by more than 10% on the studied benchmarks and using up to 35x fewer rollouts than GRPO. The “fewer rollouts” claim is what matters in production: it’s the difference between a $300 optimization pass and a $30 one.

Use when. The dataset is small to medium (50-500 items), per-rollout signals are informative (you can extract why a rollout succeeded or failed), and you want a population of winners rather than a single rewrite. GEPA is the right default for multi-prompt pipelines where you’re optimizing several prompts that interact.

6. PromptWizard: multi-stage instruction refinement

PromptWizardOptimizer (the Microsoft framework, adapted in fi.opt.optimizers.promptwizard) runs a three-stage loop per round. Mutate: generate num_variations paraphrases by mixing in different “thinking styles” (chain-of-thought, step-by-step, role-play, etc.) over mutate_rounds. Critique: a teacher LLM writes a detailed critique of why the current best fails on a subset. Refine: produce steps_per_sample improved versions conditioned on the critique. Run for refine_iterations outer iterations.

Use when. The instruction text itself is the thing you want to optimize (not few-shot examples, not output schema) and you want stylistic diversity in the candidate pool. PromptWizard is the most opinionated about how a prompt should be rewritten (thinking-style mixing is its trademark), which is helpful when you want exploration but a hindrance when your task requires a specific tone.

Defining the eval: the part that bounds everything

The optimizer is a search procedure against a metric. If the metric is wrong, the optimizer finds prompts that exploit the wrongness. That’s not a hypothetical failure mode; it’s the failure mode.

Three things to get right before running any optimizer.

Anchor the rubric in failure modes you can name. “Output quality” isn’t a rubric. “Refusal-rate parity across protected categories, schema compliance on tool calls, hallucination check against the retrieved chunks, latency under 800ms” is a rubric. Each named dimension becomes an evaluator that emits a score; the optimizer’s score_aggregator combines them. Future AGI’s ai-evaluation ships 50+ pre-built evaluators (LLM-as-judge plus heuristic) and 20+ local metrics; pick the ones that match your failure surface rather than wiring custom judges from scratch.

Split your dataset before you start. A training slice (60-70%) the optimizer sees, a validation slice (20-30%) it never sees, and a held-out test slice (10%) reserved for final verification. Score every candidate on training, pick the winner by validation, report on test. The slice you skip is the one whose overfitting will hurt you.

Add a length penalty. Free-form optimizers (ProTeGi, GEPA, PromptWizard, MetaPrompt) drift toward verbosity because most LLM judges are length-biased. A simple penalty (score multiplied by a softmax of token count above a threshold) keeps the optimizer honest. This is one line in your score_aggregator.

Search-space design: template-mode vs free-form

Two regimes, both valid.

Template-mode constrains the search space. The prompt has fixed structure (system instruction plus few-shot slots plus output schema) and the optimizer only fills the slots. BayesianSearchOptimizer is the canonical template-mode optimizer in agent-opt: the search space is which examples to include and how many, formatting is templated, the instruction text doesn’t change. Template-mode is cheaper, more stable, and easier to defend in production review.

Free-form lets the optimizer rewrite any text. ProTeGi, GEPA, PromptWizard, and MetaPrompt are free-form. Free-form has more headroom (the optimizer can find structural changes that template-mode can’t) but more failure modes (verbosity drift, contamination of eval items into the prompt body, judge-gaming).

The pattern that works: start template-mode for structured tasks (classification, extraction, schema-bound generation), reach for free-form only when the seed prompt itself is misshapen. The biggest gains usually come from template-mode optimizers with a good few-shot block, not from free-form rewrites of a clean instruction.

Compute budget management

Three knobs.

EarlyStoppingConfig accepts four orthogonal stop conditions: patience (stop after N iterations without improvement), min_score_threshold (stop when score crosses a target), min_delta (minimum improvement to reset patience), and max_evaluations (a hard cap on dataset evaluations across the run). Wire max_evaluations to your dollar budget; wire patience to your “the optimizer has plateaued” tolerance. Both flow into the same EarlyStoppingChecker regardless of which optimizer you use.

Sampling. Most optimizers expose an eval_subset_size (32 for ProTeGi default, 40 for MetaPrompt, 25 for PromptWizard). Each candidate is evaluated on the subset, not the full dataset; the final winner gets re-scored on the full set. Subset size of 30-50 is usually enough signal to rank candidates correctly.

Resumability. Only BayesianSearchOptimizer natively supports trial-level resumability via Optuna’s storage and study_name. If your runs need to span multiple CI invocations or your eval calls are expensive enough that a crashed run is costly, BayesianSearch is the only optimizer in the list with native checkpointing.

Wiring the optimizer into CI

The production loop has two cadences, not one.

Offline optimization pass. Runs when you ship a new prompt or swap models. The optimizer runs against the training slice. Validation slice picks the winner. The winner goes behind a feature flag. An A/B against the previous version on production traffic decides promotion. Mining of new failure cohorts (low eval scores, user-flagged outputs, escalations) feeds back into the next round’s training slice. This is the cycle that matters operationally; the optimizer is only as good as the data flowing into it.

Regression gate on every PR. Runs on every code change. The production eval suite (the same evaluators the optimizer used) runs against the candidate prompt; if validation score regresses against the previous version’s baseline, the PR fails. The optimizer itself doesn’t run here (too slow, too expensive) but its versioned output is what the gate guards. Tag every production span with prompt.version; when production metrics regress, attribution is automatic. Future AGI’s traceAI auto-instruments these spans across Python, TypeScript, and Java without code changes in the agent loop.

The pattern that ships in production: optimizer outputs a versioned prompt, prompt is gated by validation score, deployed behind a flag, A/B’d on real traffic, promoted on production signal, monitored with span-attached eval scores. Six steps. Each one removes a failure mode the previous step couldn’t catch.

Future AGI: where agent-opt fits in the eval stack

The six optimizers above ship in fi.opt.optimizers as part of agent-opt, which is one surface in Future AGI’s evaluation stack. The pieces compose because they share data structures.

ai-evaluation provides the Evaluator the optimizer needs: 50+ pre-built LLM-as-judge templates (Factual Accuracy, Groundedness, Tone, Toxicity, schema checks) plus 20+ local heuristic metrics that run sub-second offline. Error localization tells you which field of which input caused a judge to fail, which is exactly what ProTeGi’s gradient stage and MetaPrompt’s hypothesis stage feed on.

traceAI (Apache 2.0) auto-instruments 50+ AI surfaces across four languages. Every span carries the prompt version, the evaluator scores, and the latency breakdown; the production failure cohort that feeds next quarter’s optimization run is built from these spans, not from spreadsheet exports.

The Future AGI Platform layers the platform-only surface on top: self-improving evaluators that refresh against drift, in-product agent authoring, lower per-evaluation cost than Galileo Luna-2, SOC 2 Type II + HIPAA + GDPR + CCPA certified per trust. When the optimizer ships a winning prompt, the platform versions it, the eval gate guards it, and the trace pipeline attributes any production regression back to the version.

The honest framing: if you want a library-only OSS path, DSPy with GEPA is a solid default. If you want the optimizer to consume failing trajectories from your production traces as training data, emit a versioned prompt, and have the same evaluator score it in CI and in production, the agent-opt + ai-evaluation + traceAI bundle is what you’d otherwise stitch together from four vendors.

Common mistakes

  • Picking ProTeGi or GEPA before running RandomSearch. You don’t know if the wording is the bottleneck. Three teacher-generated paraphrases tell you in 20 minutes.
  • Optimizing without a held-out validation slice. The optimizer will find prompts that beat its training set and lose on production traffic.
  • No length penalty on a free-form optimizer. Prompts grow until the judge prefers verbosity. Cap it in the aggregator.
  • Treating optimization as a one-time pass. Production drift means your winning prompt is stale within a quarter. Mine new failures, re-run.
  • Skipping the human audit. Sample 30-50 outputs from the winner before promoting; if human ratings don’t track the eval score, the evaluator is wrong and another optimizer pass won’t fix it.
  • Co-tuning a multi-prompt pipeline by tuning each prompt in isolation. The interactions are where the gains live. GEPA and ProTeGi can handle pipeline-shaped programs; isolation can’t.

What’s next

The six optimizers cover the practical search-space-and-budget combinations a production team actually has. The shifts to watch through the rest of 2026: more distilled judges (cheaper per-eval, less length-biased), tighter integration between optimization and tracing (the failure cohort becoming the training slice automatically), and platform-level prompt versioning that makes “ship the winning prompt” a one-line operation rather than a deployment.

For the wiring from prompt management to traces, see linking prompt management with tracing. For where prompt versioning fits in the lifecycle, see what is prompt versioning?. For the broader tool landscape, see best prompt engineering tools 2026 and prompt optimization at scale.

Sources

Frequently asked questions

What is automated prompt improvement?
Automated prompt improvement is an optimization loop that takes a seed prompt, a dataset of examples, and an evaluator, then returns a rewritten prompt that scores higher against that evaluator. The engineer doesn't manually rewrite the prompt; an optimizer proposes candidates, the evaluator scores each, and a search procedure keeps the winners. In 2026 the practical implementations are a small family of named algorithms (RandomSearch, BayesianSearch, MetaPrompt, ProTeGi, GEPA, PromptWizard) that differ in how they generate candidates and how they spend their compute budget. The quality of the result is bounded by the quality of the evaluator, so the eval has to be defined first and trusted before any of these algorithms are useful.
Which prompt optimizer should I start with?
Start with RandomSearch. A teacher model generates a handful of paraphrases of your seed prompt, you score each on your dataset, and you keep the best. It's the cheapest baseline and it answers a useful question: is the prompt the bottleneck, or is the eval (or the model) the bottleneck? If RandomSearch moves the score, you have headroom and a more directed optimizer (ProTeGi, GEPA, PromptWizard) will likely move it further. If RandomSearch doesn't move the score, the issue is not the wording of the prompt and no optimizer will rescue you. After RandomSearch, pick by signal type: ProTeGi when failures cluster into nameable critiques, GEPA when you can afford a reflective evolutionary search, BayesianSearch when each eval is expensive and you want resumable trials.
How is ProTeGi different from GEPA?
ProTeGi treats text feedback as a gradient. For each failing example, a teacher LLM writes a short critique of why the prompt failed; the optimizer then asks the teacher to apply that critique and produce a new prompt; beam search keeps the top-K candidates. GEPA is genetic and reflective: it maintains a population of candidate prompts, the reflection model proposes mutations and crossovers based on per-rollout traces, and a Pareto frontier preserves diverse high performers across multiple objectives. ProTeGi is the right pick when you have crisp failures with clear critique signal. GEPA is the right pick when the search space is large, the budget is generous, and you want a population rather than a single winner.
How do I keep an automated optimizer from overfitting to my eval set?
Three guardrails. First, split your dataset into a training slice the optimizer sees and a validation slice it never sees; trust only validation scores. Second, add a length penalty to your evaluator so the optimizer can't game length-biased judges by writing 800-word prompts. Third, do a human audit on 30-50 production outputs from the winning prompt before promoting; if the human rating tracks the eval score, you have signal; if not, the evaluator is the problem and another optimizer pass won't fix it. The framing to internalize: the optimizer searches against the metric you give it, so any flaw in the metric becomes a flaw the optimizer learns to exploit.
How do I manage compute budget across an optimization run?
Use EarlyStoppingConfig and resumable Bayesian trials. EarlyStoppingConfig in agent-opt accepts four orthogonal stop conditions: patience (stop after N iterations without improvement), min_score_threshold (stop when score crosses a target), min_delta (minimum improvement to reset patience), and max_evaluations (hard cap on dataset evaluations). Configure max_evaluations to bound dollar cost; configure patience to stop when the optimizer plateaus. For BayesianSearch specifically, set an Optuna storage URL and study_name so a run that hits the budget can resume later with load_if_exists, accumulating trials across days rather than starting over.
Where does prompt optimization fit in CI?
Two cadences. The first is the offline optimization pass that runs when you ship a new prompt or swap models: the optimizer runs against a training slice, validation slice picks the winner, the winner goes behind a feature flag, and an A/B against the previous version decides promotion. The second is the regression gate that runs on every PR: the production eval suite runs against the candidate prompt; if scores regress against the previous version, the PR fails. The optimizer doesn't run on every PR (too slow, too expensive) but its output is versioned, tagged on every span as prompt.version, and the eval gate guards against silent regressions when other code changes.
What's the difference between template-mode and free-form prompt search?
Template-mode constrains the search space. The prompt has fixed structure (system instruction plus N few-shot slots plus a fixed output schema) and the optimizer only fills the slots. BayesianSearch in agent-opt works this way: the search space is which few-shot examples to include and how many, not the wording of the instruction. Free-form search is the opposite: the optimizer can rewrite anything. ProTeGi, GEPA, PromptWizard, MetaPrompt are free-form. Template-mode is cheaper, more stable, and easier to defend in production. Free-form has more headroom but more failure modes (verbosity drift, contamination, judge-gaming). Default to template-mode for structured tasks; reach for free-form when the seed prompt itself is wrong.
Related Articles
View all