What is DSPy? Stanford's Compiled Prompt Framework in 2026
DSPy is a Stanford framework that compiles LLM programs into optimized prompts. Signatures, modules, optimizers, MIPRO, and how it differs from LangChain.
Table of Contents
A team has a question-answering system over financial filings. The hand-tuned ChainOfThought prompt with three few-shot examples plateaus on their dev set. No amount of system-prompt editing pushes the metric meaningfully higher. They rewrite the same logic in DSPy: a Signature with context, question -> reasoning, answer, wrapped in a ChainOfThought module, compiled with MIPROv2 against a labeled training set. The compile pass takes the better part of a working session. The compiled program lands materially above the hand-tuned ceiling on the same dev set, with prompts and few-shot examples the team would not have written by hand. (Published DSPy benchmarks show similar shape; ReAct went from 24% to 51% and a documented RAG case from 53% to 61% in the framework’s own examples.)
This is the DSPy thesis. Hand-prompting tops out. Optimizer-compiled programming pushes past that ceiling. The price is training inputs, a metric, and a compile pass. The benefit is consistent quality lift on workloads where prompting alone cannot reach the bar. This guide covers what DSPy is, how its primitives work, how it compares to alternatives, and when to pick it.
TL;DR: What DSPy is
DSPy is an open-source MIT-licensed Python framework from the Stanford NLP group for programming language models with declarative modules and compiled prompts. The repo at github.com/stanfordnlp/dspy has approximately 34,000 GitHub stars as of mid-2026 and the framework is on the 3.x line. The primitives are Signature (input-output spec), Module (a strategy like Predict, ChainOfThought, or ReAct), Program (composed Modules), Optimizer (the compile algorithm), and Metric (the scoring function). The 2026 optimizer menu includes MIPROv2 (Bayesian search over instructions and demos), GEPA (a newer reflective optimizer), SIMBA, BootstrapFewShot, and BetterTogether. DSPy’s core innovation is that you write modules and let the optimizer write the prompts.
Why DSPy matters in 2026
Three forces pushed compiled prompts from research curiosity to production tool.
First, hand-prompted quality stopped scaling. By 2026 most teams hit a ceiling with manual prompt engineering. Adding more system-prompt instructions saturates. Adding more few-shot examples blows context budgets. The prompts that beat the ceiling were ones humans would not have written; they came from systematic search. DSPy is the leading framework that operationalizes that search.
Second, the compile-once-deploy-many pattern matched production workflows. A compiled DSPy program is a frozen artifact. You compile it once on your training data, save the compiled prompts and few-shot examples to JSON, and run the program at inference time without re-running the optimizer. This separates the expensive optimization work from the cheap inference work.
Third, the 2026 optimizer menu broadened. The original BootstrapFewShot was useful but limited. MIPROv2’s Bayesian search over both instructions and few-shot examples produces measurable lifts on most labeled benchmarks. GEPA brought a reflective-optimization approach with its own dedicated tutorials in DSPy’s docs, and SIMBA/BetterTogether cover additional regimes. Picking the right optimizer is now a real design decision.
The anatomy of a DSPy program
The framework’s primitives compose cleanly.
Signature. A declarative spec for a prompt-able task. Written as a string (“context, question -> answer”) or a class with typed input and output fields. The framework turns the Signature into a prompt automatically; you do not see the prompt unless you ask for it.
Module. A strategy class that wraps a Signature and implements a behavior. dspy.Predict asks the model the question. dspy.ChainOfThought adds a reasoning step before the answer. dspy.ReAct adds a tool-using loop. dspy.ProgramOfThought adds code execution. dspy.MultiChainComparison compares multiple ChainOfThought outputs.
Program. A composed dspy.Module subclass. You define a forward method that calls other modules in sequence or with branching. Programs can be nested arbitrarily; a Program can call another Program.
LM. The provider abstraction. dspy.LM is configured at the global or program level with a LiteLLM-style model identifier (e.g. openai/gpt-4o, anthropic/claude-sonnet-4-5-20250929, ollama_chat/llama3.2:1b) and provider-specific settings. Verify the exact provider/model strings against the current DSPy and LiteLLM docs at publish time.
Optimizer. The compile algorithm. Takes a program, a metric, and a training set. Returns a new program with frozen prompts and few-shot examples. Common optimizers in 2026: BootstrapFewShot (basic), BootstrapFewShotWithRandomSearch (better), MIPROv2 (Bayesian search over instructions and demos), GEPA (reflective), SIMBA, and BetterTogether (combines prompt search with model fine-tuning).
Metric. A function metric(example, pred) -> bool | float that scores a prediction against a gold example. Required by every optimizer. Can be exact-match, semantic similarity, an LLM-as-judge call, or any custom function.
Trainset and devset. Labeled dspy.Example lists used by the optimizer to compile and evaluate.
DSPy in 30 lines
import dspy
from dspy.teleprompt import MIPROv2
# 1. Configure the LM.
dspy.configure(lm=dspy.LM("openai/gpt-4o"))
# 2. Define a Signature.
class QASignature(dspy.Signature):
"""Answer the question using the context."""
context: str = dspy.InputField()
question: str = dspy.InputField()
answer: str = dspy.OutputField()
# 3. Wrap it in a Module.
class QAProgram(dspy.Module):
def __init__(self):
super().__init__()
self.cot = dspy.ChainOfThought(QASignature)
def forward(self, context, question):
return self.cot(context=context, question=question)
# 4. Define a metric and a training set.
def exact_match(example, pred, trace=None):
return example.answer.lower() == pred.answer.lower()
trainset = [dspy.Example(context=c, question=q, answer=a).with_inputs("context", "question")
for (c, q, a) in load_qa_pairs()]
# 5. Compile with an Optimizer.
optimizer = MIPROv2(metric=exact_match, auto="medium")
compiled_program = optimizer.compile(QAProgram(), trainset=trainset, max_bootstrapped_demos=3)
# 6. Save and load.
compiled_program.save("./compiled_qa.json")
loaded = QAProgram()
loaded.load("./compiled_qa.json")
result = loaded(context="...", question="...")
The compile pass searches over instructions and few-shot examples to maximize the metric. The result is a frozen program ready for inference.
How DSPy compares to alternatives
| Framework | Lead with | Best for | License |
|---|---|---|---|
| DSPy | Compiled prompt optimization | Tasks with a metric (and ideally labeled data) to optimize | MIT |
| LangChain | Prompt-as-string composition | Direct control over prompts and chain composition | MIT |
| LlamaIndex | RAG primitives | RAG-heavy applications | MIT |
| OpenAI Agents SDK | Agent loop with tools, handoffs, guardrails, HITL | OpenAI-centric single- or multi-agent workflows | MIT |
| Pydantic AI | Type-safe agents | Validated outputs, multi-provider stacks | MIT |
DSPy’s strength is the compile pass. If you do not have a metric and labeled data, the framework’s main feature does not apply. If you do, DSPy frequently produces lifts that hand-prompting cannot match.
Production patterns with DSPy
Three patterns recur.
Pattern 1: Compile once, deploy the JSON artifact. Compile the program in CI on a labeled training set. Commit the resulting compiled JSON to your repo or artifact store. At inference time, load the JSON and run the program. The compile pass is the expensive offline step; the inference is cheap.
Pattern 2: Module composition with a non-DSPy retriever. Wrap an existing retriever (LlamaIndex, Haystack, custom) as a callable. Use it inside a dspy.Module’s forward method. The DSPy module owns the prompts and few-shot examples; the retriever owns the data layer. This pattern combines the best of DSPy’s compile capabilities with another framework’s retrieval primitives.
Pattern 3: A/B compile against multiple metrics. Compile the same program with three different metrics (exact match, semantic similarity, LLM-as-judge). Compare the three compiled programs on a held-out set. The metric you optimize for is the metric the program optimizes for; running multiple variants surfaces the metric-quality tradeoff.
Common mistakes when adopting DSPy
- Skipping the metric. A weak metric produces a weak compile. Pick a metric that correlates with what you actually want; if your real goal is calibrated semantic correctness, do not optimize for exact match.
- Compiling on too small a trainset for the wrong optimizer. Some flows work with as few as 5-10 inputs (sometimes without labels), but MIPROv2 in particular benefits from materially more data; the DSPy docs typically suggest around 200 examples or more for serious instruction-and-demo search.
- Not separating trainset from devset. Use trainset for the optimizer and devset for the final score. Reusing the same set for both inflates the reported quality.
- Editing the compiled prompts manually. The whole point is that the framework writes the prompts. Editing them by hand reintroduces hand-prompting drift.
- Recompiling on every deploy. Compile in CI on a stable training set. Commit the JSON. Loading is fast; recompiling on every deploy is expensive and slows iteration.
- Forgetting context budget. MIPROv2 can produce few-shot example sets that push prompts over context limits. Configure max_bootstrapped_demos and max_labeled_demos for tight context budgets.
- Using DSPy where prompt-engineering is enough. A trivial single-shot task with a hand-tuned prompt that already hits the bar does not need DSPy. The framework earns its weight when the bar is higher than hand-prompting can reach.
How to trace DSPy with FutureAGI
DSPy can be instrumented to emit OpenTelemetry-compatible spans through OpenInference or traceAI. To ship traces to FutureAGI’s observability platform or any other OTel backend with traceAI:
pip install traceai-dspy
from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_dspy import DSPyInstrumentor
trace_provider = register(
project_type=ProjectType.OBSERVE,
project_name="qa-pipeline",
)
DSPyInstrumentor().instrument(tracer_provider=trace_provider)
# Your DSPy programs and modules now emit span trees once instrumentation is registered.
The resulting trace tree typically shows the program forward call at the root, every module invocation as a child span, and each Predict as a deeper LLM span. Whether the compiled prompt and few-shot examples land in the span attributes depends on instrumentation settings; verify what each package emits before relying on it.
How FutureAGI implements DSPy observability and prompt optimization
FutureAGI is the production-grade observability, evaluation, and prompt-optimization platform for DSPy built around the closed reliability loop that other DSPy stacks stitch together by hand. The full stack runs on one Apache 2.0 self-hostable plane:
- DSPy tracing, traceAI (Apache 2.0) auto-wraps DSPy programs, modules, Predict calls, and ChainOfThought signatures via
traceai-dspyfor Python; the broader traceAI library covers 35+ frameworks across Python, TypeScript, Java, and C#; compiled prompts and few-shot examples land as span attributes when content tracing is enabled. - Module-level evals, 50+ first-party metrics (Groundedness, Hallucination Detection, Context Adherence, Task Completion, Faithfulness, Tool Correctness) attach as span attributes; BYOK lets any LLM serve as the judge at zero platform fee, and
turing_flashruns the same rubrics at 50 to 70 ms p95. - Prompt optimization, six first-party algorithms (GEPA, PromptWizard, ProTeGi, Bayesian, Meta-Prompt, Random) compose with DSPy compilers and consume failing trajectories as training data, so production regressions feed back into recompiled signatures that the CI gate evaluates against the same threshold.
- Gateway and guardrails, the Agent Command Center fronts 100+ providers with BYOK routing for the model wrappers DSPy modules call, and 18+ runtime guardrails enforce policy on the same plane.
Beyond the four axes, FutureAGI also ships persona-driven simulation that exercises DSPy programs in pre-prod with the same scorer contract used in production. Pricing starts free with a 50 GB tracing tier; Boost is $250 per month, Scale is $750 per month with HIPAA, and Enterprise from $2,000 per month with SOC 2 Type II.
Most teams running DSPy in production end up running three or four tools alongside it: one for traces, one for evals, one for prompt optimization, one for the gateway and guardrails. FutureAGI is the recommended pick because tracing, evals, prompt optimization, gateway, and guardrails all live on one self-hostable runtime; the loop closes without stitching. For more on the tracing model, read What is LLM Tracing?.
Sources
- DSPy GitHub repo
- DSPy documentation
- DSPy paper (arXiv)
- MIPROv2 paper (arXiv)
- Stanford NLP group
- DSPy optimizers guide
- DSPy modules guide
- OpenInference DSPy instrumentation
- traceAI repo
Series cross-link
Related: What is Prompt Engineering?, What is LLM Tracing?, Best Prompt Engineering Tools in 2026, What is LangGraph?
Frequently asked questions
What is DSPy in plain terms?
Who maintains DSPy and what license is it under?
How is DSPy different from LangChain?
What is a DSPy Optimizer?
What is a DSPy Signature?
What is a DSPy Module?
How do you trace DSPy programs?
When should I not use DSPy?
Pydantic AI is a Python agent framework that brings Pydantic-style validation to LLM tool calls and outputs. Agents, tools, dependency injection, graphs.
CrewAI is a Python framework for role-based multi-agent orchestration. Crews, agents, tasks, flows, tools, and how it differs from LangGraph and AutoGen.
DSPy, FutureAGI Prompt Optimizer, PromptFoo, OpenAI Playground, Helicone Prompts, Braintrust Prompts, plus tradeoffs for 2026 prompt engineering workflows.