Evaluating DSPy Pipelines in 2026: Eval the Signature, Not the Program
Evaluating DSPy pipelines in 2026: why the compile metric isn't your production rubric, and how to eval the Signature instead of the program.
Table of Contents
DSPy treats a prompt the way a compiler treats source: something you optimize, not something you hand-edit. The idea is clean and the optimization loop works. The thing that breaks in production is downstream of compilation. The metric DSPy optimizes against is rarely the metric your product needs, and end-to-end scoring hides which Signature is the weak link when the program regresses. This guide is the workflow that ships: eval the Signature, not the program, layer composability metrics on top, and pair DSPy’s optimization loop with an external rubric that production and CI agree on.
TL;DR
A DSPy program needs eval on two levels: per-Signature rubrics for what each module is supposed to do, and composability metrics for the whole pipeline. The pattern that works:
| Layer | Tool | What it tells you |
|---|---|---|
| Per-Signature rubric | Evaluator + EvalTemplate (ai-evaluation) | Is the Signature doing its job? |
| Module-level span | DSPyInstrumentor (traceAI) | Which Signature emitted what, with cost and latency |
| Composability metric | CustomLLMJudge with module-attribution | Where does cascade correctness break? |
| Compile-vs-runtime diff | Same Evaluator on compile-set and live traces | Did the compile overfit the cheap metric? |
| Failure clustering | Error Feed (HDBSCAN + Sonnet 4.5 Judge) | Named clusters with immediate_fix |
| Second-pass optimization | agent-opt (six optimizers) | Re-optimize against the production rubric |
Skip the per-Signature layer and you debug an end-to-end score that tells you what fell over but not where.
Why DSPy compiles, but doesn’t ship
DSPy’s optimization-as-compilation is elegant. You write Signatures and Modules, pick a teleprompter (MIPRO, BootstrapFewShot, COPRO), hand the compiler a training set and a metric, and the string that hits the LLM in production gets generated. The compiler does the work a prompt engineer used to do by hand.
The catch is that the compiler optimizes against the metric you give it. That metric has to be cheap, because the teleprompter scores thousands of trial prompts in a pass. Cheap metrics are thin: an exact match on the final output, a substring check, a single judge call on the answer string. They’re useful for ranking trial prompts during a search, and they’re a bad proxy for what your product actually needs.
Production rubrics aren’t thin. A RAG pipeline needs groundedness against retrieved context, completeness on multi-part answers, refusal handling on out-of-scope queries, schema compliance on tool calls, and a cost budget per call. None of those compress to a single boolean the teleprompter can score 5,000 times in a compile. When the cheap compile metric and the rich production rubric disagree, the compiled prompt overfits the cheap one. Teams discover this when hold-out scores stay flat and live traffic scores fall.
That’s the metric mismatch problem in one sentence: DSPy optimizes against the metric you can afford to run during compilation, not the rubric your product is judged on in production. The fix isn’t a better cheap metric. The fix is to keep the cheap metric where it belongs (inside the compile loop) and run the rich rubric externally, at the Signature level, on the hold-out and on live traces.
The thesis: eval the Signature, not the program
A DSPy Signature is the typed contract for a single module. Input fields, output fields, instruction string, optionally output schema. A program is the composition of multiple Signatures glued by Modules (Predict, ChainOfThought, ReAct, ProgramOfThought). Most DSPy eval tutorials score the program end-to-end. That makes the score easy to compute and useless to debug.
The shift is to score every Signature against a rubric tuned to its job, then layer composability metrics on the whole pipeline. Two scores, not one.
Per-Signature rubrics. A retrieval Signature gets ContextRelevance and ChunkAttribution. A reasoning Signature gets Groundedness and ContextAdherence. A final-answer Signature gets Completeness and AnswerRefusal. A tool-use Signature gets EvaluateFunctionCalling. The rubric matches the contract the Signature declares, not the rubric the program ships with.
Pipeline composability metrics. Per-Signature scores can all pass while the program fails end-to-end. That’s the cascade correctness problem: each module is locally correct, the composition is not. The fix is a CustomLLMJudge rubric that scores module attribution: when the end-to-end answer is wrong, which module’s output was the proximate cause? Run it across a sample of failing traces and the attribution gets ranked, not guessed.
The data layer that makes this practical is OTel spans. traceAI’s DSPyInstrumentor emits a top-level CHAIN span for the program and child LLM spans for every module call, tagged with dspy.module.name and dspy.signature.name. The Signature-level rubric runs against the span’s input and output fields. The composability rubric runs against the span tree.
Step 1: instrument so the Signature is visible
The eval stack reads from OTel spans. If the spans don’t carry Signature identity, per-Signature scoring isn’t possible. The setup is two packages.
pip install dspy-ai traceAI-dspy ai-evaluation
from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_dspy import DSPyInstrumentor
trace_provider = register(
project_type=ProjectType.OBSERVE,
project_name="dspy-rag-pipeline",
)
DSPyInstrumentor().instrument(tracer_provider=trace_provider)
Every pipeline run now emits a span tree. The top-level span carries fi.span.kind=CHAIN. Each module call is a child span with fi.span.kind=LLM plus dspy.module.name and dspy.signature.name. The same pattern works for the other 50+ AI surfaces traceAI ships (LangChain, LlamaIndex, CrewAI, Haystack, the OpenAI Agents SDK) across Python, TypeScript, and Java. A hybrid stack stays on one OTel trace tree, which matters when DSPy is one module inside a larger graph. The deeper instrumentation walkthrough lives in the AI observability tools breakdown.
Step 2: split the data three ways
A DSPy program needs three datasets, not one.
- Training set. What the teleprompter compiles against. Sampled from production where possible, hand-curated where not. Usually 50-300 examples.
- Hold-out set. Never seen by the teleprompter. Used to score every compile and gate promotion. Usually 50-150 examples, refreshed weekly from production traces.
- Live-traffic sample. A 1-5% sample of production traces, scored with the same rubrics the hold-out uses. The drift signal.
The split is the same as in the LLM evaluation playbook, and the refresh cadence matters more than the raw count. A stale hold-out is how production drift hides for a quarter.
Step 3: compile and score per Signature
This is the comparison sweep. Compile the program with each teleprompter you’re considering, then score each compile at the Signature level.
import dspy
from dspy.teleprompt import MIPROv2, BootstrapFewShot, COPRO
lm = dspy.LM(model="openai/gpt-4o-mini")
dspy.configure(lm=lm)
class RetrieveSignature(dspy.Signature):
"""Retrieve passages relevant to the question."""
question = dspy.InputField()
passages = dspy.OutputField()
class AnswerSignature(dspy.Signature):
"""Answer the question grounded in retrieved passages."""
passages = dspy.InputField()
question = dspy.InputField()
answer = dspy.OutputField()
class RAGProgram(dspy.Module):
def __init__(self):
super().__init__()
self.retrieve = dspy.Predict(RetrieveSignature)
self.answer = dspy.ChainOfThought(AnswerSignature)
def forward(self, question):
ret = self.retrieve(question=question)
return self.answer(passages=ret.passages, question=question)
def compile_metric(example, pred, trace=None):
return float(example.answer.lower() in pred.answer.lower())
base = RAGProgram()
compiled_mipro = MIPROv2(metric=compile_metric, auto="light").compile(base, trainset=train_set)
compiled_bs = BootstrapFewShot(metric=compile_metric, max_bootstrapped_demos=4).compile(base, trainset=train_set)
compiled_copro = COPRO(metric=compile_metric).compile(base, trainset=train_set)
compile_metric is the cheap thing the teleprompter scores during search. The rich rubric lives outside. Score every compile against the hold-out with templates that match each Signature’s job.
from fi.evals import Evaluator
from fi.evals.templates import (
ContextRelevance, Groundedness, Completeness, AnswerRefusal,
)
from fi.testcases import TestCase
evaluator = Evaluator(
fi_api_key="<fi-api-key>",
fi_secret_key="<fi-secret-key>",
)
retrieve_templates = [ContextRelevance()]
answer_templates = [Groundedness(), Completeness(), AnswerRefusal()]
def score_per_signature(program, holdout):
retrieve_cases, answer_cases = [], []
for ex in holdout:
ret = program.retrieve(question=ex.question)
ans = program.answer(passages=ret.passages, question=ex.question)
retrieve_cases.append(TestCase(
input=ex.question, output=ret.passages, expected_output=ex.passages,
))
answer_cases.append(TestCase(
input=ex.question, output=ans.answer,
context=ret.passages, expected_output=ex.answer,
))
return {
"retrieve": evaluator.evaluate(eval_templates=retrieve_templates, inputs=retrieve_cases),
"answer": evaluator.evaluate(eval_templates=answer_templates, inputs=answer_cases),
}
scores_mipro = score_per_signature(compiled_mipro, holdout_set)
scores_bs = score_per_signature(compiled_bs, holdout_set)
scores_copro = score_per_signature(compiled_copro, holdout_set)
Now the comparison is real. If compiled_mipro wins on Groundedness but tanks on ContextRelevance, the answer module is doing fine and the retrieve module compiled badly. End-to-end scoring hides that. Per-Signature scoring names it.
For larger sweeps, the four distributed runners (Celery, Ray, Temporal, Kubernetes) parallelize the eval across compiled candidates. The 60+ EvalTemplate classes cover most rubrics teams write by hand; CustomLLMJudge covers the rest with a natural-language rubric and a pinned judge model.
Step 4: pipeline composability
Per-Signature scores can all pass while the program fails end-to-end. The retrieve Signature returns the right passages. The answer Signature reasons faithfully over them. The final answer is still wrong because the composition lost something between the two. That’s cascade correctness, and it needs its own metric.
The pattern: a CustomLLMJudge rubric that reads the full span tree (input question, retrieved passages, intermediate reasoning, final answer) and attributes the failure to a module.
from fi.evals.templates import CustomLLMJudge
attribution_judge = CustomLLMJudge(
name="module_attribution",
rubric="""Given the question, retrieved passages, chain-of-thought,
and final answer, identify which module is the proximate cause of
any failure: retrieve (wrong passages), reasoning (faithful reasoning
but wrong inference), or answer (correct inference but malformed
answer). If the answer is correct, return 'none'.""",
model="turing-large",
)
Run it on a sample of failing traces and the attribution gets distributed. A 60/30/10 split across retrieve/reasoning/answer tells you to fix retrieval first. A 20/20/60 split tells you the answer Signature is the weak link even though it scores well in isolation. End-to-end scoring would have flagged the failure and stopped there. Per-Signature scoring plus attribution tells you what to recompile. The setup borrows from the agent failure modes playbook.
Step 5: cluster failures and feed the loop
Per-Signature scores plus module attribution tell you what failed. They don’t tell you why. The Error Feed handles that step.
Failing traces feed into HDBSCAN soft-clustering, which groups them by semantic similarity. A Sonnet 4.5 Judge reads each cluster and writes an immediate_fix: a concrete rubric tweak or a prompt edit. The fix flows back into the Platform self-improving evaluators, which adjust their scoring as more feedback comes in.
Clusters that recur in DSPy programs:
- “MIPRO overfit the compile metric.” Hold-out scores high; live-traffic scores low on the same rubric. The fix is to rotate the training set and switch the compile metric to a richer judge.
- “BootstrapFewShot demos are too verbose.” Compiled demonstrations eat 60-80% of the context budget. The fix is
max_bootstrapped_demos=2plus a length-aware filter on demo candidates. - “Cascade fails on the retrieve Signature.” Module attribution names retrieve as the proximate cause of 60%+ of end-to-end failures. The fix is upstream of the LLM call. See advanced chunking techniques for RAG for the retrieval-side moves.
- “Answer Signature drifts on out-of-scope queries.”
AnswerRefusalscores drop on the live-traffic sample but not the hold-out. The hold-out is missing the adversarial distribution. Refresh from sampled production traffic.
Linear is the live routing integration today. Slack, GitHub, Jira, and PagerDuty are on the roadmap.
DSPy and agent-opt: honest framing
DSPy teleprompters and FAGI agent-opt both search the prompt space against a metric. They overlap on the optimizer step and complement each other on the eval-driven loop. Picking one to ditch isn’t the play.
agent-opt ships six optimizers:
RandomSearchOptimizerfor the baseline sweep that tells you if prompt wording is the bottleneck.BayesianSearchOptimizeron Optuna’s TPE sampler, with teacher-inferred few-shot and resumable studies that span multiple CI invocations.MetaPromptOptimizerfor hypothesis-driven single-rewrite passes.ProTeGifor text-gradient critique with beam search.GEPAOptimizerfor reflective genetic search with a Pareto frontier.PromptWizardOptimizerfor multi-stage instruction refinement with thinking-style mixing.
EarlyStoppingConfig works across all six.
The honest comparison comes down to three points. DSPy is strongest at the Signature programming model and the teleprompter integration (MIPRO and BootstrapFewShot are deeply tied to that model). agent-opt is strongest when the optimizer needs to score against the same 60+ EvalTemplate rubric the production judge uses, when budget governance matters across a multi-day search, and when the production failure cohort needs to feed back into optimization. Many teams compile with DSPy for the Signature ergonomics, then run an agent-opt second pass against the live rubric so the optimizer and the production judge agree on what better means. The deeper walkthrough is in automated prompt improvement and automated optimization for agents.
Where DSPy stops being the right tool: when the Signatures become incidental and orchestration becomes the work. DSPy is excellent when the bulk of the program is typed Signatures and the teleprompter has real headroom. It gets thin when the program is mostly tool routing, multi-agent coordination, or long-running stateful workflows. The honest move is to keep DSPy for the Signature-shaped parts and swap in LangGraph or CrewAI for orchestration. The eval stack is the same regardless.
Production observability for DSPy programs
Three production patterns show up across DSPy deployments running on the FAGI stack.
One trace tree across a hybrid stack. A DSPy ChainOfThought for the reasoning step, a LangGraph state machine for orchestration, a CrewAI subagent for tool calls. traceAI instruments all three from one OTel provider. The full request lives on one span tree, scored by one rubric suite. The Signature-level rubric runs on the DSPy spans; the orchestration rubric runs on the LangGraph spans; the tool-call rubric runs on the CrewAI spans. One trace, one comparison surface.
Compile in CI, deploy on green. Every PR triggers a compile against the latest training slice. The compile output runs through the Evaluator against the hold-out at the Signature level. The CI gate fails if any per-Signature rubric drops more than two points from the previous compile, or if the composability rubric flags a regression in module attribution. The pattern is the same as the LLM evaluation playbook CI gate, with the compile artifact as the unit instead of the prompt string.
Production scoring on sampled live traces. Same rubrics, applied to a 1-5% sample. Scores attach to OTel spans. Drift on any per-Signature rubric triggers an investigation before it triggers a recompile.
Anti-patterns that ship regressions silently
Three patterns show up in DSPy programs that broke in production.
Shipping the first compile. No comparison sweep, no baseline. The teleprompter ran once, the score looked fine, the compile went to prod. The next recompile produced something worse and nobody noticed for a week because the comparison baseline was never captured. Fix: every compile gets a compile_id and a stored Signature-level rubric snapshot.
Single-teleprompter strategy. Compile with MIPRO, ignore BootstrapFewShot and COPRO, ship whatever MIPRO produces. The honest version is to compile with each, score per Signature, and pick on a weighted blend of rubric scores, cost, and latency. Marginal compile cost is small; cost of shipping the wrong teleprompter for the workload is large.
End-to-end scoring as the only signal. A single score tells you the program got worse. It doesn’t tell you which Signature regressed, which composition step broke, or whether the issue is upstream of any LLM call. Per-Signature rubrics plus a composability judge are the difference between a debuggable regression and a guess.
Where Future AGI fits
The FAGI eval stack maps to the DSPy lifecycle in three pieces.
ai-evaluation SDK (Apache 2.0) is the rubric layer. 60+ EvalTemplate classes, CustomLLMJudge for natural-language rubrics, 13 guardrail backends, four distributed runners. One Evaluator call scores compiled-program outputs and live traces with the same rubric. The Future AGI Platform is the cost-and-feedback layer: self-improving evaluators tuned by production feedback, in-product agent authors for custom rubrics, lower per-eval cost than Galileo Luna-2, and SOC 2 Type II + HIPAA + GDPR + CCPA certified per trust. The Error Feed is the failure-clustering layer inside the eval stack: HDBSCAN over failing traces, Sonnet 4.5 Judge writes immediate_fix.
The honest framing for DSPy users: if you want an OSS-only path, DSPy plus ai-evaluation plus traceAI gets you Signature-level rubrics, per-module observability, and a CI gate today. If you want self-improving evaluators, named failure clusters, and a second-pass optimizer that scores against the same rubric the production judge uses, the platform layer plus agent-opt is what you’d otherwise stitch together from four vendors.
The piece worth naming: the trace-stream-to-agent-opt connector that turns sampled production traces into an optimizer dataset is on the roadmap, not in production today. Until it lands, the loop is traces in traceAI, failures clustered in Error Feed, immediate_fix from the Sonnet 4.5 Judge, manual promotion of the fix into the next compile or the next agent-opt run.
Where to go next
- What is DSPy for the programming model.
- Best DSPy alternatives for when DSPy is the right pick.
- LLM evaluation playbook for the six-layer eval pattern this post specializes.
- Automated prompt improvement for the agent-opt second-pass workflow.
- Agent passes evals, fails production for the drift story this post inherits.
Frequently asked questions
Why isn't DSPy's compile metric enough to evaluate a DSPy pipeline?
What does 'eval the Signature, not the program' mean?
How does Future AGI's ai-evaluation SDK plug into a DSPy program?
How does DSPy compare to FAGI agent-opt for prompt optimization?
What does traceAI's DSPyInstrumentor give you that DSPy doesn't?
What's the 5-step workflow for evaluating a DSPy pipeline?
When does DSPy stop being the right tool?
Long-context support is marketing. Long-context fidelity is what you eval: NIAH at every position, lost-in-middle on your docs, attention-budget cost.
Self-reflection loops sometimes improve outputs and sometimes destroy them. The pre-vs-post delta, over-correction rate, and cost-per-improvement that turn reflection from hope into engineering.
Heads-of-engineering buyer's guide for LLM eval vendors in 2026. Ten buying criteria, eight vendor categories scored honestly, a five-question rubric, and a procurement workflow.