DSPy Optimizers Explained in 2026: BootstrapFewShot, MIPROv2, COPRO, and GEPA
A practitioner's comparison of DSPy optimizers: how BootstrapFewShot, MIPROv2, COPRO, and GEPA differ, and a ladder for picking the right one.
Table of Contents
Originally published May 29, 2026.
You wrapped your task in a
dspy.Module, wrote a metric, and got to the line everyone gets stuck on:optimizer = ???. The docs list BootstrapFewShot, BootstrapFewShotWithRandomSearch, MIPROv2, COPRO, and GEPA, each with its own knobs, and none of them says which one is for your problem. So you pick the one from the tutorial you read last, run it for an hour, and hope.
This post settles that choice. We will walk through what each DSPy optimizer actually does, where each one wins, and a ladder for picking the cheapest one that clears your eval. If you want the method-level theory behind these (textual gradients, genetic search, meta-prompting), read our companion piece on automatic prompt optimization; this one is about DSPy’s specific implementations.
What Are DSPy Optimizers?
DSPy optimizers, formerly called teleprompters, are algorithms that tune the prompts and few-shot examples inside a DSPy program against a metric, so you do not hand-write them. You define a metric and a trainset, wrap your task in a module, pick an optimizer, and it searches for instructions and demonstrations that score higher. The four that matter in 2026 are BootstrapFewShot, MIPROv2, COPRO, and GEPA, and they differ in what they tune and how much compute they spend.
The naming changed, which trips people up. DSPy renamed teleprompters to optimizers, so older tutorials and newer docs describe the same classes with different words. The role never changed: compile a program by searching for better prompts and demos against your metric. If you are new to the framework itself, start with what DSPy is and come back for the optimizer choice.
How Does BootstrapFewShot Work?
BootstrapFewShot is the fast baseline. It runs your program over the trainset, keeps the traces where the output passed your metric, and uses those successful runs as bootstrapped few-shot demonstrations in the prompt. No new instructions are written; the lift comes entirely from showing the model good examples of the task being done right.
It is the right first move because it is cheap and it works on small data. If a handful of bootstrapped demos clears your eval, you are done, and you spent minutes rather than hours. The ceiling is also where it stops: when the instruction itself is the problem, better examples will not fix it, and you climb the ladder.
How Does MIPROv2 Work?
MIPROv2 (Multiprompt Instruction PRoposal Optimizer, version 2) optimizes two things at once: the instructions and the few-shot demos. It proposes candidate instructions, generates candidate demonstration sets, and then runs a Bayesian search over which combination of instruction and demos scores best across your trainset.
The design point is the joint search. MIPROv2 writes instructions that complement the examples it selects, instead of treating them as independent, which produces a more cohesive prompt than tuning either alone. It rewards data: with a couple hundred examples it has enough signal for the Bayesian search to find a strong combination. With twenty examples, the search is mostly noise, and BootstrapFewShot is the better spend.
How Does COPRO Work?
COPRO (Collaborative Prompt Optimization) optimizes instructions only, and it does so by coordinate ascent. It generates candidate instructions, scores them on the metric, keeps the best, generates refinements of that, scores again, and hill-climbs until the score stops improving.
COPRO sits beside MIPROv2 rather than above it. Reach for COPRO when the demonstrations are fixed or irrelevant and the instruction is the lever, for example a task where you cannot show examples for privacy reasons but you can describe the rules. When demos matter, MIPROv2’s joint search usually wins because COPRO never touches them.
How Does GEPA Work?
GEPA (Genetic-Pareto) is the reflective evolutionary optimizer, accepted as an ICLR 2026 oral. Instead of bootstrapping demos or searching instruction combinations, it reads the program’s execution trajectory, reflects in natural language on what worked and what failed, and proposes prompts that address the gaps. It evolves a population on a Pareto front across objectives, and it can use domain-specific textual feedback as part of the signal.
Two properties make GEPA worth the setup. First, the Pareto front keeps prompts that are each best at something, so multi-objective and multi-module pipelines do not get averaged into a bland compromise. Second, the reflection is sample-efficient: the GEPA paper reports roughly a 13 percent aggregate improvement over MIPROv2 and a 12 percent gain on AIME 2025 with around 35 times fewer rollouts on its benchmarks. It is the top of the ladder because it is the most capable and the most setup.
Which DSPy Optimizer Should You Use?
Here is the comparison on the axes that decide the choice.
| Optimizer | Tunes | Search method | Best for | Data and compute |
|---|---|---|---|---|
| BootstrapFewShot | Few-shot demos | Bootstrapped successful traces | Fast baseline, small data | Low, minutes |
| MIPROv2 | Instructions + demos | Bayesian joint search | Demo-heavy single-metric tasks | 200+ examples, medium |
| COPRO | Instructions only | Coordinate ascent | Instruction-led, no demos | Small to medium |
| GEPA | Whole program | Reflective Pareto evolution | Multi-objective, pipelines | Moderate, rollout-efficient |
The DSPy Optimizer Ladder
The mistake is picking the fanciest optimizer first. The rule is to run the cheapest rung that clears your eval and climb only when it does not. We call this the DSPy Optimizer Ladder.
- Start on BootstrapFewShot. Minutes to run, works on small data. If the baseline passes, ship it.
- Climb to MIPROv2 when you have a couple hundred examples and want instructions and demos tuned together, or step sideways to COPRO when only the instruction is in play.
- Climb to GEPA when you have multiple objectives, a multi-module pipeline, or MIPROv2 has plateaued and you can spend the rollouts.
Most single-prompt tasks never leave the first two rungs.
How Do You Run Optimization Without Wiring DSPy Yourself?
DSPy is a framework: you author modules and signatures, define a metric, and run the optimizer in-process. That control is the point, and it is also the cost. If you want the same algorithm families without managing the plumbing, a platform is the other path.
Future AGI’s agent-opt SDK ships ProTeGi, GEPA, and meta-prompt optimizers behind one interface, scored by the same 50+ evaluators it uses for LLM evaluation. The difference from DSPy is where the optimizer lives: instead of compiling a module in your code, you point an optimizer at a dataset and an evaluator, and the score that picks the winner is the same score you monitor in production.
from fi.opt.optimizers import GEPAOptimizer
from fi.opt.base.evaluator import Evaluator
from fi.opt.datamappers.basic_mapper import BasicDataMapper
evaluator = Evaluator(metric=your_metric) # a Future AGI metric object
data_mapper = BasicDataMapper(key_map={"output": "output", "expected": "expected"})
optimizer = GEPAOptimizer(
reflection_model="gpt-4o", # reflects on failures, proposes prompts
generator_model="gpt-4o-mini", # runs the candidate prompts
)
result = optimizer.optimize(
evaluator=evaluator,
data_mapper=data_mapper,
dataset=test_cases, # list of {input, expected} dicts
initial_prompts=["You are a helpful assistant."],
max_metric_calls=150, # the rollout budget
)
print(result.best_generator.prompt_template) # the winning prompt
print(result.final_score)
Where Each Layer Falls Short
- DSPy needs a metric and a trainset. Every optimizer maximizes the metric, so a weak metric yields a program that games it. Calibrate the metric first, and if you are scoring a multi-step program, our guide to evaluating DSPy pipelines covers signature-level metrics.
- GEPA and MIPROv2 cost rollouts. They are more compute than BootstrapFewShot. Budget them like a CI job and use the ladder so you only pay for the rung you need.
- The platform trades control for convenience. If you want full module-level and signature-level control of a complex pipeline, DSPy stays the better fit. The platform wins when you want the optimizer and the eval in one stack.
Is DSPy the Right Layer for You?
DSPy is the open-source standard for treating prompts as something you compile rather than write, and its optimizers are the clearest implementations of the major algorithm families. If you are weighing it against other frameworks entirely, see our DSPy alternatives breakdown. The choice between them is not about which is newest; it is about how much data and compute you have and whether you are tuning one prompt or a pipeline. Run the ladder: BootstrapFewShot first, MIPROv2 or COPRO when the instruction and demos need joint work, GEPA when the problem is multi-objective. And if managing the framework is more than you want, agent-opt gives you the same families with the eval already wired in.
Want to optimize a prompt against your eval without compiling a DSPy module? Start with Future AGI’s optimization docs and run a GEPA or ProTeGi pass on the dataset you already have.
Sources
Frequently asked questions
What are DSPy optimizers?
What is the difference between MIPROv2 and GEPA in DSPy?
Which DSPy optimizer should I use first?
Do DSPy optimizers need a labeled dataset?
What replaced teleprompters in DSPy?
How do I run prompt optimization without managing DSPy myself?
Automatic prompt optimization explained: textual gradients (ProTeGi), score trajectories (OPRO), genetic evolution (GEPA), meta-prompting, and how to pick one.
PII and toxicity scanners never see the tool call. Agent runtime guardrails (tool permissions, MCP security, system-prompt protection) catch what they miss.
A generic chatbot answers questions about your data. Falcon AI runs the eval, drills the trace, and files the ticket, with 300+ tools and page context.