Guides

Evaluating DSPy Pipelines in 2026: Eval the Signature, Not the Program

Evaluating DSPy pipelines in 2026: why the compile metric isn't your production rubric, and how to eval the Signature instead of the program.

April 3, 2026

Updated May 20, 2026

11 min read

dspy llm-evaluation prompt-optimization agent-evaluation llm-observability 2026

Table of Contents

DSPy treats a prompt the way a compiler treats source: something you optimize, not something you hand-edit. The idea is clean and the optimization loop works. The thing that breaks in production is downstream of compilation. The metric DSPy optimizes against is rarely the metric your product needs, and end-to-end scoring hides which Signature is the weak link when the program regresses. This guide is the workflow that ships: eval the Signature, not the program, layer composability metrics on top, and pair DSPy’s optimization loop with an external rubric that production and CI agree on.

TL;DR

A DSPy program needs eval on two levels: per-Signature rubrics for what each module is supposed to do, and composability metrics for the whole pipeline. The pattern that works:

Layer	Tool	What it tells you
Per-Signature rubric	`Evaluator` + `EvalTemplate` (`ai-evaluation`)	Is the Signature doing its job?
Module-level span	`DSPyInstrumentor` (traceAI)	Which Signature emitted what, with cost and latency
Composability metric	`CustomLLMJudge` with module-attribution	Where does cascade correctness break?
Compile-vs-runtime diff	Same Evaluator on compile-set and live traces	Did the compile overfit the cheap metric?
Failure clustering	Error Feed (HDBSCAN + Sonnet 4.5 Judge)	Named clusters with `immediate_fix`
Second-pass optimization	`agent-opt` (six optimizers)	Re-optimize against the production rubric

Skip the per-Signature layer and you debug an end-to-end score that tells you what fell over but not where.

Why DSPy compiles, but doesn’t ship

DSPy’s optimization-as-compilation is elegant. You write Signatures and Modules, pick a teleprompter (MIPRO, BootstrapFewShot, COPRO), hand the compiler a training set and a metric, and the string that hits the LLM in production gets generated. The compiler does the work a prompt engineer used to do by hand.

The catch is that the compiler optimizes against the metric you give it. That metric has to be cheap, because the teleprompter scores thousands of trial prompts in a pass. Cheap metrics are thin: an exact match on the final output, a substring check, a single judge call on the answer string. They’re useful for ranking trial prompts during a search, and they’re a bad proxy for what your product actually needs.

Production rubrics aren’t thin. A RAG pipeline needs groundedness against retrieved context, completeness on multi-part answers, refusal handling on out-of-scope queries, schema compliance on tool calls, and a cost budget per call. None of those compress to a single boolean the teleprompter can score 5,000 times in a compile. When the cheap compile metric and the rich production rubric disagree, the compiled prompt overfits the cheap one. Teams discover this when hold-out scores stay flat and live traffic scores fall.

That’s the metric mismatch problem in one sentence: DSPy optimizes against the metric you can afford to run during compilation, not the rubric your product is judged on in production. The fix isn’t a better cheap metric. The fix is to keep the cheap metric where it belongs (inside the compile loop) and run the rich rubric externally, at the Signature level, on the hold-out and on live traces.

The thesis: eval the Signature, not the program

A DSPy Signature is the typed contract for a single module. Input fields, output fields, instruction string, optionally output schema. A program is the composition of multiple Signatures glued by Modules (Predict, ChainOfThought, ReAct, ProgramOfThought). Most DSPy eval tutorials score the program end-to-end. That makes the score easy to compute and useless to debug.

The shift is to score every Signature against a rubric tuned to its job, then layer composability metrics on the whole pipeline. Two scores, not one.

Per-Signature rubrics. A retrieval Signature gets ContextRelevance and ChunkAttribution. A reasoning Signature gets Groundedness and ContextAdherence. A final-answer Signature gets Completeness and AnswerRefusal. A tool-use Signature gets EvaluateFunctionCalling. The rubric matches the contract the Signature declares, not the rubric the program ships with.

Pipeline composability metrics. Per-Signature scores can all pass while the program fails end-to-end. That’s the cascade correctness problem: each module is locally correct, the composition is not. The fix is a CustomLLMJudge rubric that scores module attribution: when the end-to-end answer is wrong, which module’s output was the proximate cause? Run it across a sample of failing traces and the attribution gets ranked, not guessed.

The data layer that makes this practical is OTel spans. traceAI’s DSPyInstrumentor emits a top-level CHAIN span for the program and child LLM spans for every module call, tagged with dspy.module.name and dspy.signature.name. The Signature-level rubric runs against the span’s input and output fields. The composability rubric runs against the span tree.

Step 1: instrument so the Signature is visible

The eval stack reads from OTel spans. If the spans don’t carry Signature identity, per-Signature scoring isn’t possible. The setup is two packages.

pip install dspy-ai traceAI-dspy ai-evaluation

from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_dspy import DSPyInstrumentor

trace_provider = register(
    project_type=ProjectType.OBSERVE,
    project_name="dspy-rag-pipeline",
)

DSPyInstrumentor().instrument(tracer_provider=trace_provider)

Every pipeline run now emits a span tree. The top-level span carries fi.span.kind=CHAIN. Each module call is a child span with fi.span.kind=LLM plus dspy.module.name and dspy.signature.name. The same pattern works for the other 50+ AI surfaces traceAI ships (LangChain, LlamaIndex, CrewAI, Haystack, the OpenAI Agents SDK) across Python, TypeScript, and Java. A hybrid stack stays on one OTel trace tree, which matters when DSPy is one module inside a larger graph. The deeper instrumentation walkthrough lives in the AI observability tools breakdown.

Step 2: split the data three ways

A DSPy program needs three datasets, not one.

Training set. What the teleprompter compiles against. Sampled from production where possible, hand-curated where not. Usually 50-300 examples.
Hold-out set. Never seen by the teleprompter. Used to score every compile and gate promotion. Usually 50-150 examples, refreshed weekly from production traces.
Live-traffic sample. A 1-5% sample of production traces, scored with the same rubrics the hold-out uses. The drift signal.

The split is the same as in the LLM evaluation playbook, and the refresh cadence matters more than the raw count. A stale hold-out is how production drift hides for a quarter.

Step 3: compile and score per Signature

This is the comparison sweep. Compile the program with each teleprompter you’re considering, then score each compile at the Signature level.

import dspy
from dspy.teleprompt import MIPROv2, BootstrapFewShot, COPRO

lm = dspy.LM(model="openai/gpt-4o-mini")
dspy.configure(lm=lm)

class RetrieveSignature(dspy.Signature):
    """Retrieve passages relevant to the question."""
    question = dspy.InputField()
    passages = dspy.OutputField()

class AnswerSignature(dspy.Signature):
    """Answer the question grounded in retrieved passages."""
    passages = dspy.InputField()
    question = dspy.InputField()
    answer = dspy.OutputField()

class RAGProgram(dspy.Module):
    def __init__(self):
        super().__init__()
        self.retrieve = dspy.Predict(RetrieveSignature)
        self.answer = dspy.ChainOfThought(AnswerSignature)

    def forward(self, question):
        ret = self.retrieve(question=question)
        return self.answer(passages=ret.passages, question=question)

def compile_metric(example, pred, trace=None):
    return float(example.answer.lower() in pred.answer.lower())

base = RAGProgram()
compiled_mipro = MIPROv2(metric=compile_metric, auto="light").compile(base, trainset=train_set)
compiled_bs = BootstrapFewShot(metric=compile_metric, max_bootstrapped_demos=4).compile(base, trainset=train_set)
compiled_copro = COPRO(metric=compile_metric).compile(base, trainset=train_set)

compile_metric is the cheap thing the teleprompter scores during search. The rich rubric lives outside. Score every compile against the hold-out with templates that match each Signature’s job.

from fi.evals import Evaluator
from fi.evals.templates import (
    ContextRelevance, Groundedness, Completeness, AnswerRefusal,
)
from fi.testcases import TestCase

evaluator = Evaluator(
    fi_api_key="<fi-api-key>",
    fi_secret_key="<fi-secret-key>",
)

retrieve_templates = [ContextRelevance()]
answer_templates = [Groundedness(), Completeness(), AnswerRefusal()]

def score_per_signature(program, holdout):
    retrieve_cases, answer_cases = [], []
    for ex in holdout:
        ret = program.retrieve(question=ex.question)
        ans = program.answer(passages=ret.passages, question=ex.question)
        retrieve_cases.append(TestCase(
            input=ex.question, output=ret.passages, expected_output=ex.passages,
        ))
        answer_cases.append(TestCase(
            input=ex.question, output=ans.answer,
            context=ret.passages, expected_output=ex.answer,
        ))
    return {
        "retrieve": evaluator.evaluate(eval_templates=retrieve_templates, inputs=retrieve_cases),
        "answer": evaluator.evaluate(eval_templates=answer_templates, inputs=answer_cases),
    }

scores_mipro = score_per_signature(compiled_mipro, holdout_set)
scores_bs = score_per_signature(compiled_bs, holdout_set)
scores_copro = score_per_signature(compiled_copro, holdout_set)

Now the comparison is real. If compiled_mipro wins on Groundedness but tanks on ContextRelevance, the answer module is doing fine and the retrieve module compiled badly. End-to-end scoring hides that. Per-Signature scoring names it.

For larger sweeps, the four distributed runners (Celery, Ray, Temporal, Kubernetes) parallelize the eval across compiled candidates. The 60+ EvalTemplate classes cover most rubrics teams write by hand; CustomLLMJudge covers the rest with a natural-language rubric and a pinned judge model.

Step 4: pipeline composability

Per-Signature scores can all pass while the program fails end-to-end. The retrieve Signature returns the right passages. The answer Signature reasons faithfully over them. The final answer is still wrong because the composition lost something between the two. That’s cascade correctness, and it needs its own metric.

The pattern: a CustomLLMJudge rubric that reads the full span tree (input question, retrieved passages, intermediate reasoning, final answer) and attributes the failure to a module.

from fi.evals.templates import CustomLLMJudge

attribution_judge = CustomLLMJudge(
    name="module_attribution",
    rubric="""Given the question, retrieved passages, chain-of-thought,
    and final answer, identify which module is the proximate cause of
    any failure: retrieve (wrong passages), reasoning (faithful reasoning
    but wrong inference), or answer (correct inference but malformed
    answer). If the answer is correct, return 'none'.""",
    model="turing-large",
)

Run it on a sample of failing traces and the attribution gets distributed. A 60/30/10 split across retrieve/reasoning/answer tells you to fix retrieval first. A 20/20/60 split tells you the answer Signature is the weak link even though it scores well in isolation. End-to-end scoring would have flagged the failure and stopped there. Per-Signature scoring plus attribution tells you what to recompile. The setup borrows from the agent failure modes playbook.

Step 5: cluster failures and feed the loop

Per-Signature scores plus module attribution tell you what failed. They don’t tell you why. The Error Feed handles that step.

Failing traces feed into HDBSCAN soft-clustering, which groups them by semantic similarity. A Sonnet 4.5 Judge reads each cluster and writes an immediate_fix: a concrete rubric tweak or a prompt edit. The fix flows back into the Platform self-improving evaluators, which adjust their scoring as more feedback comes in.

Clusters that recur in DSPy programs:

“MIPRO overfit the compile metric.” Hold-out scores high; live-traffic scores low on the same rubric. The fix is to rotate the training set and switch the compile metric to a richer judge.
“BootstrapFewShot demos are too verbose.” Compiled demonstrations eat 60-80% of the context budget. The fix is max_bootstrapped_demos=2 plus a length-aware filter on demo candidates.
“Cascade fails on the retrieve Signature.” Module attribution names retrieve as the proximate cause of 60%+ of end-to-end failures. The fix is upstream of the LLM call. See advanced chunking techniques for RAG for the retrieval-side moves.
“Answer Signature drifts on out-of-scope queries.” AnswerRefusal scores drop on the live-traffic sample but not the hold-out. The hold-out is missing the adversarial distribution. Refresh from sampled production traffic.

Linear is the live routing integration today. Slack, GitHub, Jira, and PagerDuty are on the roadmap.

DSPy and agent-opt: honest framing

DSPy teleprompters and FAGI agent-opt both search the prompt space against a metric. They overlap on the optimizer step and complement each other on the eval-driven loop. Picking one to ditch isn’t the play.

agent-opt ships six optimizers:

RandomSearchOptimizer for the baseline sweep that tells you if prompt wording is the bottleneck.
BayesianSearchOptimizer on Optuna’s TPE sampler, with teacher-inferred few-shot and resumable studies that span multiple CI invocations.
MetaPromptOptimizer for hypothesis-driven single-rewrite passes.
ProTeGi for text-gradient critique with beam search.
GEPAOptimizer for reflective genetic search with a Pareto frontier.
PromptWizardOptimizer for multi-stage instruction refinement with thinking-style mixing.

EarlyStoppingConfig works across all six.

The honest comparison comes down to three points. DSPy is strongest at the Signature programming model and the teleprompter integration (MIPRO and BootstrapFewShot are deeply tied to that model). agent-opt is strongest when the optimizer needs to score against the same 60+ EvalTemplate rubric the production judge uses, when budget governance matters across a multi-day search, and when the production failure cohort needs to feed back into optimization. Many teams compile with DSPy for the Signature ergonomics, then run an agent-opt second pass against the live rubric so the optimizer and the production judge agree on what better means. The deeper walkthrough is in automated prompt improvement and automated optimization for agents.

Where DSPy stops being the right tool: when the Signatures become incidental and orchestration becomes the work. DSPy is excellent when the bulk of the program is typed Signatures and the teleprompter has real headroom. It gets thin when the program is mostly tool routing, multi-agent coordination, or long-running stateful workflows. The honest move is to keep DSPy for the Signature-shaped parts and swap in LangGraph or CrewAI for orchestration. The eval stack is the same regardless.

Production observability for DSPy programs

Three production patterns show up across DSPy deployments running on the FAGI stack.

One trace tree across a hybrid stack. A DSPy ChainOfThought for the reasoning step, a LangGraph state machine for orchestration, a CrewAI subagent for tool calls. traceAI instruments all three from one OTel provider. The full request lives on one span tree, scored by one rubric suite. The Signature-level rubric runs on the DSPy spans; the orchestration rubric runs on the LangGraph spans; the tool-call rubric runs on the CrewAI spans. One trace, one comparison surface.

Compile in CI, deploy on green. Every PR triggers a compile against the latest training slice. The compile output runs through the Evaluator against the hold-out at the Signature level. The CI gate fails if any per-Signature rubric drops more than two points from the previous compile, or if the composability rubric flags a regression in module attribution. The pattern is the same as the LLM evaluation playbook CI gate, with the compile artifact as the unit instead of the prompt string.

Production scoring on sampled live traces. Same rubrics, applied to a 1-5% sample. Scores attach to OTel spans. Drift on any per-Signature rubric triggers an investigation before it triggers a recompile.

Anti-patterns that ship regressions silently

Three patterns show up in DSPy programs that broke in production.

Shipping the first compile. No comparison sweep, no baseline. The teleprompter ran once, the score looked fine, the compile went to prod. The next recompile produced something worse and nobody noticed for a week because the comparison baseline was never captured. Fix: every compile gets a compile_id and a stored Signature-level rubric snapshot.

Single-teleprompter strategy. Compile with MIPRO, ignore BootstrapFewShot and COPRO, ship whatever MIPRO produces. The honest version is to compile with each, score per Signature, and pick on a weighted blend of rubric scores, cost, and latency. Marginal compile cost is small; cost of shipping the wrong teleprompter for the workload is large.

End-to-end scoring as the only signal. A single score tells you the program got worse. It doesn’t tell you which Signature regressed, which composition step broke, or whether the issue is upstream of any LLM call. Per-Signature rubrics plus a composability judge are the difference between a debuggable regression and a guess.

Where Future AGI fits

The FAGI eval stack maps to the DSPy lifecycle in three pieces.

ai-evaluation SDK (Apache 2.0) is the rubric layer. 60+ EvalTemplate classes, CustomLLMJudge for natural-language rubrics, 13 guardrail backends, four distributed runners. One Evaluator call scores compiled-program outputs and live traces with the same rubric. The Future AGI Platform is the cost-and-feedback layer: self-improving evaluators tuned by production feedback, in-product agent authors for custom rubrics, lower per-eval cost than Galileo Luna-2, and SOC 2 Type II + HIPAA + GDPR + CCPA certified per trust. The Error Feed is the failure-clustering layer inside the eval stack: HDBSCAN over failing traces, Sonnet 4.5 Judge writes immediate_fix.

The honest framing for DSPy users: if you want an OSS-only path, DSPy plus ai-evaluation plus traceAI gets you Signature-level rubrics, per-module observability, and a CI gate today. If you want self-improving evaluators, named failure clusters, and a second-pass optimizer that scores against the same rubric the production judge uses, the platform layer plus agent-opt is what you’d otherwise stitch together from four vendors.

The piece worth naming: the trace-stream-to-agent-opt connector that turns sampled production traces into an optimizer dataset is on the roadmap, not in production today. Until it lands, the loop is traces in traceAI, failures clustered in Error Feed, immediate_fix from the Sonnet 4.5 Judge, manual promotion of the fix into the next compile or the next agent-opt run.

Where to go next

What is DSPy for the programming model.
Best DSPy alternatives for when DSPy is the right pick.
LLM evaluation playbook for the six-layer eval pattern this post specializes.
Automated prompt improvement for the agent-opt second-pass workflow.
Agent passes evals, fails production for the drift story this post inherits.

Frequently asked questions

Why isn't DSPy's compile metric enough to evaluate a DSPy pipeline?

DSPy compiles a program against a cheap metric so the teleprompter can score thousands of trial prompts in a single optimization pass. The metric has to be cheap, which means it's almost always a thin function of the final output: an exact match, a substring check, a single-judge call. Production scoring uses a richer rubric: groundedness against retrieved context, schema compliance on tool calls, refusal handling, latency, and cost. When the cheap compile metric and the rich production rubric disagree, the compiled prompt overfits the cheap one. The pattern that ships in production is to keep the cheap metric for compilation and run a separate, signature-level rubric suite on the hold-out and on live traces.

What does 'eval the Signature, not the program' mean?

A DSPy Signature is the typed input-output contract for a single module: input fields, output fields, instruction string. A DSPy program is the composition of multiple Signatures, glued by Modules like ChainOfThought or ReAct. Most DSPy eval tutorials score the program end-to-end. That hides which Signature is broken when the end-to-end score drops. Signature-level eval scores each Signature against a rubric tuned to its job: a retrieval Signature scored on context relevance, a reasoning Signature scored on groundedness, a final-answer Signature scored on completeness and refusal. Then you layer pipeline composability metrics (cascade correctness, module-level error attribution) on the whole thing. Two scores instead of one: per-Signature rubrics plus end-to-end composability.

How does Future AGI's ai-evaluation SDK plug into a DSPy program?

ai-evaluation is the external rubric. You compile with DSPy as usual, then run the same Evaluator suite against the compiled program's outputs and against per-module spans captured by traceAI's DSPyInstrumentor. The 60+ EvalTemplate classes (Groundedness, ContextAdherence, Completeness, TaskCompletion, EvaluateFunctionCalling, AnswerRefusal, plus CustomLLMJudge for everything else) cover most rubrics teams write by hand. Same Evaluator call runs in CI on the hold-out and on a 1-5% sample of live traffic, so the production judge and the CI gate agree on what better means.

How does DSPy compare to FAGI agent-opt for prompt optimization?

Both search the prompt space against a metric, and the wedge isn't replacement. DSPy is strong at structured Signatures and multi-module composition; the teleprompter (MIPRO, BootstrapFewShot, COPRO) ships with that programming model. agent-opt ships six optimizers (RandomSearch, BayesianSearch, MetaPrompt, ProTeGi, GEPA, PromptWizard) that score against the same 60+ EvalTemplate suite the rest of the FAGI eval stack uses. Many teams compile with DSPy for the Signature ergonomics, then run agent-opt as a second pass against the live production rubric. DSPy gives you the program shape; agent-opt gives you optimization that agrees with production scoring.

What does traceAI's DSPyInstrumentor give you that DSPy doesn't?

Per-module observability without changing the DSPy code. Call DSPyInstrumentor().instrument(tracer_provider=trace_provider) once at startup and every pipeline run emits an OpenTelemetry span tree. The top-level CHAIN span covers the whole program; each module call is a child LLM span with attributes for input, output, latency, and token cost. That's the data layer the rest of the eval stack reads from. Without it, module-level signal (which Signature caused the end-to-end failure) is guesswork. With it, the attribution is automatic and the same OTel format works with the other 50+ AI surfaces traceAI ships.

What's the 5-step workflow for evaluating a DSPy pipeline?

(1) Instrument the program with DSPyInstrumentor so every module call becomes an OTel span with dspy.module.name and dspy.signature.name. (2) Build three datasets: a training slice the teleprompter sees, a hold-out the teleprompter never sees, and a live-traffic sample refreshed weekly from production traces. (3) Compile with each teleprompter you care about (MIPRO, BootstrapFewShot, COPRO) and capture every compile with a compile_id. (4) Score each compiled program with the Evaluator suite at the Signature level (per-module rubrics) and the program level (composability rubrics). (5) Cluster failing traces with the Error Feed (HDBSCAN plus a Sonnet 4.5 Judge that writes an immediate_fix), promote the fix into the next compile or the next agent-opt pass.

When does DSPy stop being the right tool?

When the Signatures become incidental and the orchestration becomes the work. DSPy is excellent when the bulk of the program is typed input-output contracts and the teleprompter has real headroom to optimize them. It gets thin when you need fine-grained tool routing, multi-agent coordination, long-running stateful workflows, or production-grade gateway features (routing, fallbacks, budgets). The honest framing: keep DSPy for the Signature-shaped parts of the program, swap in LangGraph or CrewAI for orchestration, and use the FAGI eval stack to score the whole thing under one rubric.

View all

Guides

The 2026 LLM Evaluation Playbook

The pillar playbook for LLM evaluation in 2026: dataset, metrics, judge, CI gate, production observation, closed loop from failing trace to regression.

Rishav Hada · Apr 12, 2026

10 min

Guides

The LLM Eval Vendor Buyer's Guide for 2026

Heads-of-engineering buyer guide for LLM eval vendors 2026. Ten criteria, eight vendor categories scored honestly, 5-question rubric, procurement flow.

Nikhil Pareek · Mar 16, 2026

17 min

Guides

Evaluating LLM Context Window Management (2026)

Long-context support is marketing. Long-context fidelity is what you eval: NIAH at every position, lost-in-middle on your docs, attention-budget cost.

Rishav Hada · Mar 11, 2026

12 min