Research

What is DSPy? Stanford's Compiled Prompt Framework in 2026

DSPy is a Stanford framework that compiles LLM programs into optimized prompts. Signatures, modules, optimizers, MIPRO, and how it differs from LangChain.

June 24, 2025

8 min read

dspy stanford prompt-optimization compiled-prompts mipro agent-framework python open-source 2026

Table of Contents

A team has a question-answering system over financial filings. The hand-tuned ChainOfThought prompt with three few-shot examples plateaus on their dev set. No amount of system-prompt editing pushes the metric meaningfully higher. They rewrite the same logic in DSPy: a Signature with context, question -> reasoning, answer, wrapped in a ChainOfThought module, compiled with MIPROv2 against a labeled training set. The compile pass takes the better part of a working session. The compiled program lands materially above the hand-tuned ceiling on the same dev set, with prompts and few-shot examples the team would not have written by hand. (Published DSPy benchmarks show similar shape; ReAct went from 24% to 51% and a documented RAG case from 53% to 61% in the framework’s own examples.)

This is the DSPy thesis. Hand-prompting tops out. Optimizer-compiled programming pushes past that ceiling. The price is training inputs, a metric, and a compile pass. The benefit is consistent quality lift on workloads where prompting alone cannot reach the bar. This guide covers what DSPy is, how its primitives work, how it compares to alternatives, and when to pick it.

TL;DR: What DSPy is

DSPy is an open-source MIT-licensed Python framework from the Stanford NLP group for programming language models with declarative modules and compiled prompts. The repo at github.com/stanfordnlp/dspy has approximately 34,000 GitHub stars as of mid-2026 and the framework is on the 3.x line. The primitives are Signature (input-output spec), Module (a strategy like Predict, ChainOfThought, or ReAct), Program (composed Modules), Optimizer (the compile algorithm), and Metric (the scoring function). The 2026 optimizer menu includes MIPROv2 (Bayesian search over instructions and demos), GEPA (a newer reflective optimizer), SIMBA, BootstrapFewShot, and BetterTogether. DSPy’s core innovation is that you write modules and let the optimizer write the prompts.

Why DSPy matters in 2026

Three forces pushed compiled prompts from research curiosity to production tool.

First, hand-prompted quality stopped scaling. By 2026 most teams hit a ceiling with manual prompt engineering. Adding more system-prompt instructions saturates. Adding more few-shot examples blows context budgets. The prompts that beat the ceiling were ones humans would not have written; they came from systematic search. DSPy is the leading framework that operationalizes that search.

Second, the compile-once-deploy-many pattern matched production workflows. A compiled DSPy program is a frozen artifact. You compile it once on your training data, save the compiled prompts and few-shot examples to JSON, and run the program at inference time without re-running the optimizer. This separates the expensive optimization work from the cheap inference work.

Third, the 2026 optimizer menu broadened. The original BootstrapFewShot was useful but limited. MIPROv2’s Bayesian search over both instructions and few-shot examples produces measurable lifts on most labeled benchmarks. GEPA brought a reflective-optimization approach with its own dedicated tutorials in DSPy’s docs, and SIMBA/BetterTogether cover additional regimes. Picking the right optimizer is now a real design decision.

The anatomy of a DSPy program

The framework’s primitives compose cleanly.

Signature. A declarative spec for a prompt-able task. Written as a string (“context, question -> answer”) or a class with typed input and output fields. The framework turns the Signature into a prompt automatically; you do not see the prompt unless you ask for it.

Module. A strategy class that wraps a Signature and implements a behavior. dspy.Predict asks the model the question. dspy.ChainOfThought adds a reasoning step before the answer. dspy.ReAct adds a tool-using loop. dspy.ProgramOfThought adds code execution. dspy.MultiChainComparison compares multiple ChainOfThought outputs.

Program. A composed dspy.Module subclass. You define a forward method that calls other modules in sequence or with branching. Programs can be nested arbitrarily; a Program can call another Program.

LM. The provider abstraction. dspy.LM is configured at the global or program level with a LiteLLM-style model identifier (e.g. openai/gpt-4o, anthropic/claude-sonnet-4-5-20250929, ollama_chat/llama3.2:1b) and provider-specific settings. Verify the exact provider/model strings against the current DSPy and LiteLLM docs at publish time.

Optimizer. The compile algorithm. Takes a program, a metric, and a training set. Returns a new program with frozen prompts and few-shot examples. Common optimizers in 2026: BootstrapFewShot (basic), BootstrapFewShotWithRandomSearch (better), MIPROv2 (Bayesian search over instructions and demos), GEPA (reflective), SIMBA, and BetterTogether (combines prompt search with model fine-tuning).

Metric. A function metric(example, pred) -> bool | float that scores a prediction against a gold example. Required by every optimizer. Can be exact-match, semantic similarity, an LLM-as-judge call, or any custom function.

Trainset and devset. Labeled dspy.Example lists used by the optimizer to compile and evaluate.

DSPy in 30 lines

import dspy
from dspy.teleprompt import MIPROv2

# 1. Configure the LM.
dspy.configure(lm=dspy.LM("openai/gpt-4o"))

# 2. Define a Signature.
class QASignature(dspy.Signature):
    """Answer the question using the context."""
    context: str = dspy.InputField()
    question: str = dspy.InputField()
    answer: str = dspy.OutputField()

# 3. Wrap it in a Module.
class QAProgram(dspy.Module):
    def __init__(self):
        super().__init__()
        self.cot = dspy.ChainOfThought(QASignature)

    def forward(self, context, question):
        return self.cot(context=context, question=question)

# 4. Define a metric and a training set.
def exact_match(example, pred, trace=None):
    return example.answer.lower() == pred.answer.lower()

trainset = [dspy.Example(context=c, question=q, answer=a).with_inputs("context", "question")
            for (c, q, a) in load_qa_pairs()]

# 5. Compile with an Optimizer.
optimizer = MIPROv2(metric=exact_match, auto="medium")
compiled_program = optimizer.compile(QAProgram(), trainset=trainset, max_bootstrapped_demos=3)

# 6. Save and load.
compiled_program.save("./compiled_qa.json")
loaded = QAProgram()
loaded.load("./compiled_qa.json")
result = loaded(context="...", question="...")

The compile pass searches over instructions and few-shot examples to maximize the metric. The result is a frozen program ready for inference.

How DSPy compares to alternatives

Framework	Lead with	Best for	License
DSPy	Compiled prompt optimization	Tasks with a metric (and ideally labeled data) to optimize	MIT
LangChain	Prompt-as-string composition	Direct control over prompts and chain composition	MIT
LlamaIndex	RAG primitives	RAG-heavy applications	MIT
OpenAI Agents SDK	Agent loop with tools, handoffs, guardrails, HITL	OpenAI-centric single- or multi-agent workflows	MIT
Pydantic AI	Type-safe agents	Validated outputs, multi-provider stacks	MIT

DSPy’s strength is the compile pass. If you do not have a metric and labeled data, the framework’s main feature does not apply. If you do, DSPy frequently produces lifts that hand-prompting cannot match.

Production patterns with DSPy

Three patterns recur.

Pattern 1: Compile once, deploy the JSON artifact. Compile the program in CI on a labeled training set. Commit the resulting compiled JSON to your repo or artifact store. At inference time, load the JSON and run the program. The compile pass is the expensive offline step; the inference is cheap.

Pattern 2: Module composition with a non-DSPy retriever. Wrap an existing retriever (LlamaIndex, Haystack, custom) as a callable. Use it inside a dspy.Module’s forward method. The DSPy module owns the prompts and few-shot examples; the retriever owns the data layer. This pattern combines the best of DSPy’s compile capabilities with another framework’s retrieval primitives.

Pattern 3: A/B compile against multiple metrics. Compile the same program with three different metrics (exact match, semantic similarity, LLM-as-judge). Compare the three compiled programs on a held-out set. The metric you optimize for is the metric the program optimizes for; running multiple variants surfaces the metric-quality tradeoff.

Common mistakes when adopting DSPy

Skipping the metric. A weak metric produces a weak compile. Pick a metric that correlates with what you actually want; if your real goal is calibrated semantic correctness, do not optimize for exact match.
Compiling on too small a trainset for the wrong optimizer. Some flows work with as few as 5-10 inputs (sometimes without labels), but MIPROv2 in particular benefits from materially more data; the DSPy docs typically suggest around 200 examples or more for serious instruction-and-demo search.
Not separating trainset from devset. Use trainset for the optimizer and devset for the final score. Reusing the same set for both inflates the reported quality.
Editing the compiled prompts manually. The whole point is that the framework writes the prompts. Editing them by hand reintroduces hand-prompting drift.
Recompiling on every deploy. Compile in CI on a stable training set. Commit the JSON. Loading is fast; recompiling on every deploy is expensive and slows iteration.
Forgetting context budget. MIPROv2 can produce few-shot example sets that push prompts over context limits. Configure max_bootstrapped_demos and max_labeled_demos for tight context budgets.
Using DSPy where prompt-engineering is enough. A trivial single-shot task with a hand-tuned prompt that already hits the bar does not need DSPy. The framework earns its weight when the bar is higher than hand-prompting can reach.

How to trace DSPy with FutureAGI

DSPy can be instrumented to emit OpenTelemetry-compatible spans through OpenInference or traceAI. To ship traces to FutureAGI’s observability platform or any other OTel backend with traceAI:

pip install traceai-dspy

from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_dspy import DSPyInstrumentor

trace_provider = register(
    project_type=ProjectType.OBSERVE,
    project_name="qa-pipeline",
)
DSPyInstrumentor().instrument(tracer_provider=trace_provider)

# Your DSPy programs and modules now emit span trees once instrumentation is registered.

The resulting trace tree typically shows the program forward call at the root, every module invocation as a child span, and each Predict as a deeper LLM span. Whether the compiled prompt and few-shot examples land in the span attributes depends on instrumentation settings; verify what each package emits before relying on it.

How FutureAGI implements DSPy observability and prompt optimization

FutureAGI is the production-grade observability, evaluation, and prompt-optimization platform for DSPy built around the closed reliability loop that other DSPy stacks stitch together by hand. The full stack runs on one Apache 2.0 self-hostable plane:

DSPy tracing, traceAI (Apache 2.0) auto-wraps DSPy programs, modules, Predict calls, and ChainOfThought signatures via traceai-dspy for Python; the broader traceAI library covers 35+ frameworks across Python, TypeScript, Java, and C#; compiled prompts and few-shot examples land as span attributes when content tracing is enabled.
Module-level evals, 50+ first-party metrics (Groundedness, Hallucination Detection, Context Adherence, Task Completion, Faithfulness, Tool Correctness) attach as span attributes; BYOK lets any LLM serve as the judge at zero platform fee, and turing_flash runs the same rubrics at 50 to 70 ms p95.
Prompt optimization, six first-party algorithms (GEPA, PromptWizard, ProTeGi, Bayesian, Meta-Prompt, Random) compose with DSPy compilers and consume failing trajectories as training data, so production regressions feed back into recompiled signatures that the CI gate evaluates against the same threshold.
Gateway and guardrails, the Agent Command Center fronts 100+ providers with BYOK routing for the model wrappers DSPy modules call, and 18+ runtime guardrails enforce policy on the same plane.

Beyond the four axes, FutureAGI also ships persona-driven simulation that exercises DSPy programs in pre-prod with the same scorer contract used in production. Pricing starts free with a 50 GB tracing tier; Boost is $250 per month, Scale is $750 per month with HIPAA, and Enterprise from $2,000 per month with SOC 2 Type II.

Most teams running DSPy in production end up running three or four tools alongside it: one for traces, one for evals, one for prompt optimization, one for the gateway and guardrails. FutureAGI is the recommended pick because tracing, evals, prompt optimization, gateway, and guardrails all live on one self-hostable runtime; the loop closes without stitching. For more on the tracing model, read What is LLM Tracing?.

Sources

Series cross-link

Frequently asked questions

What is DSPy in plain terms?

DSPy is a Python framework from Stanford for programming language models with declarative modules instead of hand-written prompts. You define a Signature (input-output spec), wrap it in a Module (a strategy like Predict, ChainOfThought, or ReAct), and compile the program with an Optimizer that searches for the best prompt and few-shot examples against a training set with a metric. The output is a compiled program that performs measurably better than the same logic with hand-tuned prompts.

Who maintains DSPy and what license is it under?

DSPy is maintained by Omar Khattab and the Stanford NLP group, with contributions from a wider research community. The codebase at github.com/stanfordnlp/dspy is MIT-licensed Python. The repo has approximately 34,000 GitHub stars as of mid-2026 and the framework is on the 3.x line. The associated research papers (DSPy, MIPRO, MIPROv2, GEPA, BootstrapFewShot) are published from the Stanford NLP group.

How is DSPy different from LangChain?

DSPy is opinionated about programming abstraction over prompts. You write declarative Python modules with input-output signatures; the framework compiles them into prompts. LangChain treats prompts as first-class strings you write directly. The DSPy advantage is that swapping models or upgrading the framework leaves your program logic untouched; the compiled prompts regenerate. The LangChain advantage is direct control over the exact prompt sent to the model. The two are complementary; some teams use LangChain for orchestration and DSPy for individual modules.

What is a DSPy Optimizer?

An Optimizer (formerly called a Teleprompter in earlier versions) is the algorithm that compiles a DSPy program. It takes a program, a training set, and a metric, and produces a new program with optimized prompts and few-shot examples. Bundled optimizers include BootstrapFewShot (sample few-shot examples), BootstrapFewShotWithRandomSearch (search over example sets), MIPROv2 (Bayesian search over prompt instructions and examples), GEPA (a newer reflective-optimizer with its own dedicated tutorials), SIMBA, and BetterTogether (combine prompt and weight optimization). MIPROv2 is a widely used general-purpose option for instruction and demo search; the right choice depends on task, dataset size, and budget.

What is a DSPy Signature?

A Signature is a declarative spec for a prompt-able task. You declare input fields (with optional descriptions) and output fields (with optional descriptions). The framework turns the Signature into a prompt automatically; the prompt format is the framework's concern, not yours. A Signature is the smallest unit of declarative LLM programming in DSPy. Example: 'question -> answer' or 'context, question -> reasoning, answer'.

What is a DSPy Module?

A Module is a strategy class that wraps a Signature and implements a behavior. Predict is the simplest module: ask the model the question. ChainOfThought adds a reasoning step before the answer. ReAct adds a tool-using loop. ProgramOfThought adds code execution. You can compose modules into larger programs by subclassing dspy.Module and writing a forward method that calls other modules. The same program can be compiled with different optimizers without changing the module code.

How do you trace DSPy programs?

DSPy can be instrumented to emit OpenTelemetry-compatible spans via OpenInference or traceAI. OpenInference's openinference-instrumentation-dspy auto-wraps modules, predicts, and tool calls. traceAI ships traceai-dspy with similar coverage. After installing one of these packages, the trace tree typically shows the program forward call at the root, every module invocation as a child span, and Predict calls as deeper LLM spans. Prompt and few-shot capture depends on the instrumentation's settings; verify what each package emits before standardizing on either.

When should I not use DSPy?

Skip DSPy when you have neither a metric nor any training inputs; optimizers can run with as few as 5-10 inputs (sometimes without labels) for some flows, but a serious MIPROv2 run typically wants closer to 200 examples. Skip it when the prompt budget is so tight that the optimizer's few-shot examples push you over context limits. Skip it for trivial single-shot tasks where one good prompt is enough. DSPy earns its weight when you have a quality bar that hand-prompting cannot reliably meet, a metric you can compute on a held-out set, and the patience to run a compile pass that may take hours.

View all

Research

What is Pydantic AI? Type-Safe Agent Framework in 2026

Pydantic AI is a Python agent framework that brings Pydantic-style validation to LLM tool calls and outputs. Agents, tools, dependency injection, graphs.

Vrinda Damani · Apr 30, 2026

8 min

Research

What is CrewAI? Multi-Agent Framework Explained in 2026

CrewAI is a Python framework for role-based multi-agent orchestration. Crews, agents, tasks, flows, tools, and how it differs from LangGraph and AutoGen.

Vrinda Damani · Nov 28, 2025

9 min

Research

Best Prompt Engineering Tools in 2026: 7 Platforms Compared

DSPy, FutureAGI Prompt Optimizer, PromptFoo, OpenAI Playground, Helicone Prompts, Braintrust Prompts, plus tradeoffs for 2026 prompt engineering workflows.

NVJK Kartik · Nov 14, 2025

11 min