Engineering

DSPy Optimizers Explained in 2026: BootstrapFewShot, MIPROv2, COPRO, and GEPA

A practitioner's comparison of DSPy optimizers: how BootstrapFewShot, MIPROv2, COPRO, and GEPA differ, and a ladder for picking the right one.

May 29, 2026

6 min read

dspy dspy-optimizers prompt-optimization gepa miprov2 2026

Table of Contents

Originally published May 29, 2026.

You wrapped your task in a dspy.Module, wrote a metric, and got to the line everyone gets stuck on: optimizer = ???. The docs list BootstrapFewShot, BootstrapFewShotWithRandomSearch, MIPROv2, COPRO, and GEPA, each with its own knobs, and none of them says which one is for your problem. So you pick the one from the tutorial you read last, run it for an hour, and hope.

This post settles that choice. We will walk through what each DSPy optimizer actually does, where each one wins, and a ladder for picking the cheapest one that clears your eval. If you want the method-level theory behind these (textual gradients, genetic search, meta-prompting), read our companion piece on automatic prompt optimization; this one is about DSPy’s specific implementations.

What Are DSPy Optimizers?

DSPy optimizers, formerly called teleprompters, are algorithms that tune the prompts and few-shot examples inside a DSPy program against a metric, so you do not hand-write them. You define a metric and a trainset, wrap your task in a module, pick an optimizer, and it searches for instructions and demonstrations that score higher. The four that matter in 2026 are BootstrapFewShot, MIPROv2, COPRO, and GEPA, and they differ in what they tune and how much compute they spend.

The naming changed, which trips people up. DSPy renamed teleprompters to optimizers, so older tutorials and newer docs describe the same classes with different words. The role never changed: compile a program by searching for better prompts and demos against your metric. If you are new to the framework itself, start with what DSPy is and come back for the optimizer choice.

How Does BootstrapFewShot Work?

BootstrapFewShot is the fast baseline. It runs your program over the trainset, keeps the traces where the output passed your metric, and uses those successful runs as bootstrapped few-shot demonstrations in the prompt. No new instructions are written; the lift comes entirely from showing the model good examples of the task being done right.

It is the right first move because it is cheap and it works on small data. If a handful of bootstrapped demos clears your eval, you are done, and you spent minutes rather than hours. The ceiling is also where it stops: when the instruction itself is the problem, better examples will not fix it, and you climb the ladder.

How Does MIPROv2 Work?

MIPROv2 (Multiprompt Instruction PRoposal Optimizer, version 2) optimizes two things at once: the instructions and the few-shot demos. It proposes candidate instructions, generates candidate demonstration sets, and then runs a Bayesian search over which combination of instruction and demos scores best across your trainset.

The design point is the joint search. MIPROv2 writes instructions that complement the examples it selects, instead of treating them as independent, which produces a more cohesive prompt than tuning either alone. It rewards data: with a couple hundred examples it has enough signal for the Bayesian search to find a strong combination. With twenty examples, the search is mostly noise, and BootstrapFewShot is the better spend.

How Does COPRO Work?

COPRO (Collaborative Prompt Optimization) optimizes instructions only, and it does so by coordinate ascent. It generates candidate instructions, scores them on the metric, keeps the best, generates refinements of that, scores again, and hill-climbs until the score stops improving.

COPRO sits beside MIPROv2 rather than above it. Reach for COPRO when the demonstrations are fixed or irrelevant and the instruction is the lever, for example a task where you cannot show examples for privacy reasons but you can describe the rules. When demos matter, MIPROv2’s joint search usually wins because COPRO never touches them.

How Does GEPA Work?

GEPA (Genetic-Pareto) is the reflective evolutionary optimizer, accepted as an ICLR 2026 oral. Instead of bootstrapping demos or searching instruction combinations, it reads the program’s execution trajectory, reflects in natural language on what worked and what failed, and proposes prompts that address the gaps. It evolves a population on a Pareto front across objectives, and it can use domain-specific textual feedback as part of the signal.

Two properties make GEPA worth the setup. First, the Pareto front keeps prompts that are each best at something, so multi-objective and multi-module pipelines do not get averaged into a bland compromise. Second, the reflection is sample-efficient: the GEPA paper reports roughly a 13 percent aggregate improvement over MIPROv2 and a 12 percent gain on AIME 2025 with around 35 times fewer rollouts on its benchmarks. It is the top of the ladder because it is the most capable and the most setup.

Which DSPy Optimizer Should You Use?

Here is the comparison on the axes that decide the choice.

Optimizer	Tunes	Search method	Best for	Data and compute
BootstrapFewShot	Few-shot demos	Bootstrapped successful traces	Fast baseline, small data	Low, minutes
MIPROv2	Instructions + demos	Bayesian joint search	Demo-heavy single-metric tasks	200+ examples, medium
COPRO	Instructions only	Coordinate ascent	Instruction-led, no demos	Small to medium
GEPA	Whole program	Reflective Pareto evolution	Multi-objective, pipelines	Moderate, rollout-efficient

The DSPy Optimizer Ladder

The mistake is picking the fanciest optimizer first. The rule is to run the cheapest rung that clears your eval and climb only when it does not. We call this the DSPy Optimizer Ladder.

Start on BootstrapFewShot. Minutes to run, works on small data. If the baseline passes, ship it.
Climb to MIPROv2 when you have a couple hundred examples and want instructions and demos tuned together, or step sideways to COPRO when only the instruction is in play.
Climb to GEPA when you have multiple objectives, a multi-module pipeline, or MIPROv2 has plateaued and you can spend the rollouts.

Most single-prompt tasks never leave the first two rungs.

How Do You Run Optimization Without Wiring DSPy Yourself?

DSPy is a framework: you author modules and signatures, define a metric, and run the optimizer in-process. That control is the point, and it is also the cost. If you want the same algorithm families without managing the plumbing, a platform is the other path.

Future AGI’s agent-opt SDK ships ProTeGi, GEPA, and meta-prompt optimizers behind one interface, scored by the same 50+ evaluators it uses for LLM evaluation. The difference from DSPy is where the optimizer lives: instead of compiling a module in your code, you point an optimizer at a dataset and an evaluator, and the score that picks the winner is the same score you monitor in production.

from fi.opt.optimizers import GEPAOptimizer
from fi.opt.base.evaluator import Evaluator
from fi.opt.datamappers.basic_mapper import BasicDataMapper

evaluator = Evaluator(metric=your_metric)              # a Future AGI metric object
data_mapper = BasicDataMapper(key_map={"output": "output", "expected": "expected"})

optimizer = GEPAOptimizer(
    reflection_model="gpt-4o",          # reflects on failures, proposes prompts
    generator_model="gpt-4o-mini",      # runs the candidate prompts
)

result = optimizer.optimize(
    evaluator=evaluator,
    data_mapper=data_mapper,
    dataset=test_cases,                            # list of {input, expected} dicts
    initial_prompts=["You are a helpful assistant."],
    max_metric_calls=150,                          # the rollout budget
)

print(result.best_generator.prompt_template)   # the winning prompt
print(result.final_score)

Where Each Layer Falls Short

DSPy needs a metric and a trainset. Every optimizer maximizes the metric, so a weak metric yields a program that games it. Calibrate the metric first, and if you are scoring a multi-step program, our guide to evaluating DSPy pipelines covers signature-level metrics.
GEPA and MIPROv2 cost rollouts. They are more compute than BootstrapFewShot. Budget them like a CI job and use the ladder so you only pay for the rung you need.
The platform trades control for convenience. If you want full module-level and signature-level control of a complex pipeline, DSPy stays the better fit. The platform wins when you want the optimizer and the eval in one stack.

Is DSPy the Right Layer for You?

DSPy is the open-source standard for treating prompts as something you compile rather than write, and its optimizers are the clearest implementations of the major algorithm families. If you are weighing it against other frameworks entirely, see our DSPy alternatives breakdown. The choice between them is not about which is newest; it is about how much data and compute you have and whether you are tuning one prompt or a pipeline. Run the ladder: BootstrapFewShot first, MIPROv2 or COPRO when the instruction and demos need joint work, GEPA when the problem is multi-objective. And if managing the framework is more than you want, agent-opt gives you the same families with the eval already wired in.

Want to optimize a prompt against your eval without compiling a DSPy module? Start with Future AGI’s optimization docs and run a GEPA or ProTeGi pass on the dataset you already have.

Sources

Frequently asked questions

What are DSPy optimizers?

DSPy optimizers, formerly called teleprompters, are algorithms that tune the prompts and few-shot examples inside a DSPy program against a metric, so you do not hand-write them. You define a metric and a trainset, wrap your task in a DSPy module, pick an optimizer, and it searches for instructions and demonstrations that score higher. The main ones in 2026 are BootstrapFewShot (bootstraps demos), MIPROv2 (joint instruction and demo search), COPRO (instruction-only coordinate ascent), and GEPA (reflective evolutionary search). They differ in what they tune and how much compute they spend.

What is the difference between MIPROv2 and GEPA in DSPy?

MIPROv2 proposes candidate instructions and few-shot demos, then runs a Bayesian search over which combination scores best, and it shines when you have a couple hundred labeled examples. GEPA is reflective and evolutionary: it reads the program's execution trajectory, reflects in natural language on what failed, and evolves prompts on a Pareto front across objectives. The GEPA paper reports roughly a 13 percent aggregate lift over MIPROv2 with far fewer rollouts on its benchmarks. Rule of thumb: MIPROv2 for single-metric demo-heavy tasks with data, GEPA for multi-objective or multi-module pipelines and tight rollout budgets.

Which DSPy optimizer should I use first?

BootstrapFewShot, almost always. It is the fastest to run and gives you a baseline in minutes by bootstrapping few-shot demonstrations from your trainset. If the baseline clears your eval, stop there. Climb to MIPROv2 when you have a couple hundred examples and want instructions and demos tuned together, and to GEPA when you have multiple objectives, a multi-module pipeline, or MIPROv2 has plateaued. That escalation, cheapest rung first, is what we call the DSPy Optimizer Ladder. Picking GEPA on day one for a one-prompt classifier is overkill.

Do DSPy optimizers need a labeled dataset?

They need a metric and a trainset, which is softer than full ground-truth labels. The metric is a function that scores a prediction, and it can be an exact match, a deterministic check, or an LLM-as-a-judge with a rubric when you have no reference answers. BootstrapFewShot needs enough examples to bootstrap demonstrations; MIPROv2 benefits from a couple hundred. The optimizer maximizes whatever the metric rewards, so a weak metric produces a program that games the metric rather than a better one.

What replaced teleprompters in DSPy?

Nothing replaced them; they were renamed. DSPy's optimization classes were originally called teleprompters and are now officially called optimizers, with the same role: tune the prompts and demonstrations in a DSPy program against a metric. If you read older tutorials referencing teleprompters and newer docs referencing optimizers, they are the same family of classes. The current lineup includes BootstrapFewShot, MIPROv2, COPRO, and GEPA.

How do I run prompt optimization without managing DSPy myself?

Use a platform that ships the same algorithm families as a managed service. DSPy is a framework you wire into your code: you author modules and signatures, define a metric, and run the optimizer in-process. If you want the optimization without that plumbing, Future AGI's agent-opt SDK exposes ProTeGi, GEPA, and meta-prompt optimizers behind one interface, scored by the same 50+ evaluators it uses for evals. The trade-off is control: DSPy gives you full module-level control, the platform gives you the optimizer and the eval in one stack.

View all

Engineering

Automatic Prompt Optimization in 2026: How Textual Gradients, Genetic Search, and Meta-Prompts Actually Work

Automatic prompt optimization explained: textual gradients (ProTeGi), score trajectories (OPRO), genetic evolution (GEPA), meta-prompting, and how to pick one.

Rishav Hada · May 29, 2026

10 min

Engineering

Agent Runtime Guardrails in 2026: The Tool-Call Scanners Most Stacks Skip

PII and toxicity scanners never see the tool call. Agent runtime guardrails (tool permissions, MCP security, system-prompt protection) catch what they miss.

Rishav Hada · May 29, 2026

7 min

Engineering

Falcon AI in 2026: The Platform-Native Copilot That Operates Your Eval Stack

A generic chatbot answers questions about your data. Falcon AI runs the eval, drills the trace, and files the ticket, with 300+ tools and page context.

Rishav Hada · May 29, 2026

7 min