Evaluating Instructor Structured Outputs (2026)
Evaluating Instructor structured outputs in 2026: per-field rubrics, cross-field consistency, numeric drift, and traceAI instrumentation.
Table of Contents
A customer-extraction agent on Instructor ships at 98.7 percent schema-pass rate. Three weeks later the schema-pass rate is still 98.7 percent. But Customer.age keeps coming back at 35 on prompts that never mention an age, Invoice.total disagrees with sum(line_items) on roughly 4 percent of orders, and the severity: 'urgent' cases all carry escalation_required: False. Every Pydantic check passes. Every value is the wrong value. The eval that watches the schema is blind to all three.
Structured outputs guarantee the shape. They do not guarantee the content. Pydantic answers one question: is this JSON well-formed against the model. The questions that break in production are different ones: are the per-field values correct, do the fields agree with each other, do the numbers add up. This post is the working pattern for evaluating Instructor in 2026: schema validation as the free floor, per-field semantic rubrics as the actual work, cross-field consistency as the bug nobody runs in unit tests, plus the InstructorInstrumentor that makes every call legible and the Error Feed loop that clusters failures by field.
Why schema-valid is not eval-valid
Instructor’s pitch is that you stop parsing JSON. You write a Pydantic model, pass it as response_model, and the library handles function calling, parsing, and retry-on-validation-failure. By the time the call returns you have a typed object. The shape is guaranteed.
That is also where the evaluation usually stops. The reasoning sounds clean: Pydantic already checks the output, so what is left to score. Plenty. The schema check is one assertion per field, and it is the wrong one.
Consider this object, returned for the prompt “Extract the customer from: ‘Hi, I’d like to update my email to a@b.com.’”:
Customer(
id=2147483647,
age=204,
signup_date=date(1812, 4, 19),
email="a@b.com",
tier="enterprise",
)
Every field passes Pydantic. id is an int, age is an int, signup_date is a date, email matches EmailStr, tier is one of the four Literal values. The schema-pass-rate metric reports 100 percent. The object is wrong on four fields and the only correct one is email.
This is the gap. Schema validation is a question about syntax. The model passes syntax effortlessly because the function-calling protocol writes the JSON for it. The questions that matter are semantic: is the value plausible for the input, does it agree with the other fields, does it sum correctly when arithmetic is implied.
The eval stack splits into three layers that map to the three classes of question. Schema validation is free; Instructor does it. Per-field semantic rubrics are the work, one judge per field with non-trivial semantics. Cross-field consistency is the bug everyone misses, a list of deterministic assertions spanning two or more fields. The remainder of this post is each layer with the working code.
Per-field rubric design: typed fields need typed judges
A Field(min_length=10, max_length=200) constraint says the string is between 10 and 200 characters. It says nothing about whether the string is on topic. A Literal["low", "normal", "high", "urgent"] says the value is one of four strings. It says nothing about which of the four is correct for the input. The semantic check is the part you write.
The rule of thumb: any field whose correctness depends on the input gets a CustomLLMJudge. Pure-structure fields (a UUID, a fixed-format date when the format is the entire spec) do not. Anything else does.
from pydantic import BaseModel, Field
from typing import Literal
from datetime import date
class TicketDraft(BaseModel):
summary: str = Field(min_length=10, max_length=200)
severity: Literal["low", "normal", "high", "urgent"]
customer_id: str
suggested_kb_articles: list[str] = Field(default_factory=list)
escalation_required: bool
summary needs a judge: it is free text that must reference the actual issue, and a 47-character lorem-ipsum string would pass the length check. severity needs a judge: the Literal accepts any of four values, and the right value depends on the input. customer_id does not need a judge if its format is fully specified. suggested_kb_articles needs a judge for relevance. escalation_required is the one that needs a cross-field check (next section), not a single-field judge.
The judge pattern is one rubric per field, each rubric narrow enough that the score is unambiguous. A monolithic “is this object correct” judge produces uninterpretable averages. A per-field judge tells you which field is failing, which is the diagnostic that names the fix.
from fi.evals import Evaluator
from fi.evals.templates import CustomLLMJudge
evaluator = Evaluator(fi_api_key="...", fi_secret_key="...")
severity_judge = CustomLLMJudge(
name="SeverityFidelity",
rubric=(
"Given the user input and the chosen severity value, score whether the "
"severity reflects the case. 'low' for routine questions, 'normal' for "
"single-customer issues, 'high' for outages affecting multiple users, "
"'urgent' for safety, security, or revenue-blocking incidents. "
"Score 1-5. 5 = correct. 1 = off by two levels or more."
),
input_mapping={
"user_input": "input",
"severity_value": "output.severity",
},
)
summary_judge = CustomLLMJudge(
name="SummaryGroundedness",
rubric=(
"Score whether the summary references the actual issue in the input. "
"5 = summary accurately describes the issue. 1 = summary is unrelated, "
"generic, or contains content not in the input."
),
input_mapping={
"user_input": "input",
"summary_value": "output.summary",
},
)
Per-field thresholds in CI follow the same logic. SeverityFidelity runs at >= 4.5 on a 1-5 scale because miscategorising urgency has real cost. SummaryGroundedness runs at >= 4.0 because the text has more variation. A flat aggregate hides the field that is breaking; per-axis thresholds fail the build on the axis that actually drifted. One bisect instead of three.
Cross-field consistency: the bug everyone misses
Schema validation fires per field. Cross-field bugs span fields. They are the failure mode that schema-only tests structurally cannot catch, and they are the most common silent failure in Instructor work.
Three shapes recur. The arithmetic shape: a derived field disagrees with its sources (Invoice.total != sum(line_items.price), Order.shipping_total != sum(packages.shipping_cost)). The temporal shape: a date-range invariant breaks (booking.end_date < booking.start_date, subscription.canceled_at < subscription.started_at). The logical-implication shape: a flag and a category disagree (severity == 'urgent' but escalation_required is False, status == 'shipped' but tracking_number is None).
Every one of these passes per-field Pydantic. Every one of these breaks the downstream consumer that assumed the relationship held. Every one of these is one line of Python to check.
from pydantic import BaseModel, Field, model_validator
from typing import Literal
from datetime import date
class LineItem(BaseModel):
sku: str
quantity: int = Field(gt=0)
unit_price: float = Field(ge=0)
class Invoice(BaseModel):
invoice_id: str
customer_id: str
line_items: list[LineItem]
total: float = Field(ge=0)
issued_at: date
due_at: date
@model_validator(mode="after")
def _consistency(self) -> "Invoice":
derived = round(sum(li.quantity * li.unit_price for li in self.line_items), 2)
if abs(self.total - derived) > 0.01:
raise ValueError(
f"total {self.total} disagrees with sum(line_items) {derived}"
)
if self.due_at < self.issued_at:
raise ValueError(f"due_at {self.due_at} precedes issued_at {self.issued_at}")
return self
@model_validator(mode="after") runs after every field has been validated, sees the whole object, and raises on any cross-field invariant. Instructor catches the ValueError and retries the LLM with the validator message in the prompt, so the model has the chance to fix its arithmetic before the call returns. The check is deterministic, runs at parse time, and pulls a non-trivial fraction of production failures back inside Instructor’s existing retry loop where they belong.
For consistency rules that need a model (a free-text field that should be consistent with a structured field, for instance), the check still runs deterministically at eval time as an assertion, not as a judge. Save the judge calls for the genuinely fuzzy questions.
def cross_field_checks(obj: TicketDraft) -> list[str]:
errors = []
if obj.severity in {"high", "urgent"} and not obj.escalation_required:
errors.append(
f"severity={obj.severity} implies escalation_required=True"
)
if obj.severity == "low" and obj.escalation_required:
errors.append("severity=low contradicts escalation_required=True")
if obj.suggested_kb_articles and obj.severity == "urgent":
# urgent cases should not be self-serve-routed to KB
errors.append("urgent cases should not return suggested_kb_articles")
return errors
Run this as a deterministic pre-pass. Rows that fail short-circuit straight to the failure bucket without touching a judge. The CI report names the assertion that fired, which is the artifact that names the fix.
Numeric correctness: LLMs lie inside well-typed floats
The most under-rated failure mode on structured outputs is numeric drift. The model is sampling tokens, not computing arithmetic, and a Field(ge=0, le=120) constraint on age says only that the number is between 0 and 120. The model can return 35 every time and pass every check. On prompts that never mention an age, that is what it tends to do — the average customer in its training corpus is around there.
Three checks blunt the problem. None of them are an LLM judge.
Tighten the range. Pydantic constraints are typically set by what is physically possible (age <= 120). The constraint that catches drift is set by what is plausible for this prompt. A user-onboarding flow rarely sees customers under 13 or over 90, so Field(ge=13, le=90) is a sharper floor. The narrower the range, the more drift it catches at parse time.
Cross-field arithmetic. When one numeric field is derivable from others, derive it and assert equality. Invoice.total == sum(line_items.quantity * line_items.unit_price). Order.discount_amount == subtotal * discount_percent / 100. Trip.duration_days == (end_date - start_date).days. Each of these is one line, runs at parse time, and catches a class of failure no LLM judge will reproduce reliably.
Numeric plausibility judge. For numbers that are not derivable, a narrow CustomLLMJudge scores whether the value is plausible for the input. Same pattern as the per-field semantic judges, except the score weights statistical realism rather than enum correctness.
from pydantic import BaseModel, Field, model_validator
class Order(BaseModel):
items: list[LineItem]
subtotal: float = Field(ge=0)
discount_percent: float = Field(ge=0, le=100)
discount_amount: float = Field(ge=0)
shipping_cost: float = Field(ge=0)
total: float = Field(ge=0)
@model_validator(mode="after")
def _math(self) -> "Order":
derived_subtotal = round(
sum(li.quantity * li.unit_price for li in self.items), 2
)
if abs(self.subtotal - derived_subtotal) > 0.01:
raise ValueError(f"subtotal != sum(items): {self.subtotal} vs {derived_subtotal}")
derived_discount = round(self.subtotal * self.discount_percent / 100, 2)
if abs(self.discount_amount - derived_discount) > 0.01:
raise ValueError(f"discount_amount != subtotal * discount_percent")
derived_total = round(self.subtotal - self.discount_amount + self.shipping_cost, 2)
if abs(self.total - derived_total) > 0.01:
raise ValueError(f"total != subtotal - discount + shipping")
return self
Three lines of arithmetic, three failure modes that schema-only validation will never see. The model_validator raises, Instructor retries with the error in the prompt, the model usually fixes its arithmetic on attempt two. The retries do show up in your bill, so track them per prompt template to keep the cost surface visible.
traceAI’s InstructorInstrumentor: what the spans actually carry
You cannot evaluate what you cannot see. The starting point for any production Instructor program is making every call legible as an OTel span tree. The InstructorInstrumentor (verified at traceAI/python/frameworks/instructor/traceai_instructor/__init__.py) does it with one call after the standard register.
pip install instructor traceai-instructor ai-evaluation
import os
os.environ["FI_API_KEY"] = "..."
os.environ["FI_SECRET_KEY"] = "..."
from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_instructor import InstructorInstrumentor
trace_provider = register(
project_type=ProjectType.OBSERVE,
project_name="instructor-app",
project_version_name="v1.4.0",
)
InstructorInstrumentor().instrument(tracer_provider=trace_provider)
The instrumentor wraps instructor.core.client.Instructor.create and AsyncInstructor.create, plus the legacy instructor.patch entry. Every patched call emits a span with fi.span.kind=TOOL. The request kwargs (model, response_model, messages, max_retries) serialize as the input.value JSON; the parsed Pydantic instance serializes as the output.value JSON in application/json MIME. The underlying LLM round trips ride on the OpenAI or Anthropic instrumentor below this one in the tree — token counts, per-attempt latency, and the raw completion content live there. If ai-evaluation is installed, the Protect guardrail wrapper attaches automatically so any Protect.protect call inside a tool inherits the span tree.
Instructor 1.x runs retries inside the same create call, so one logical structured-output operation appears as one parent TOOL span with the final parsed object on output.value and the underlying provider attempts nested below. That is the right granularity for evaluation aggregation — you score per logical call, not per HTTP round trip.
Pair the instrumentor with the EvalTag mechanism on the tracer and the per-field judges attach server-side. SeverityFidelity reads only spans whose output.value carries a severity field; total != sum(line_items) rows route to the consistency failure bucket. No application-side polling, no inline latency, and the production stream gets the same rubric stack as the offline regression set.
Production patterns: what the loop looks like end to end
The order of operations that works on Instructor projects.
Layer the gates cheapest-first. Pydantic schema validation and model_validator cross-field checks run at parse time inside Instructor at no token cost. Deterministic post-parse assertions (range tightening, regex checks) run next. Per-field CustomLLMJudge calls run last, only on rows the deterministic gates let through. Judges are the expensive token line item; push as much catch as possible to the free layers.
Build the golden set from production retries. 300 rows where each is a prompt that produced two or more attempts in the last 30 days of live traffic. That is where the model is closest to its limits and where the rubric needs the most resolution. Generic synthetic data underweights the prompts that actually fail.
Score offline and live with the same rubrics. Offline: evaluator.evaluate(...) over the golden set, gated in CI on per-axis thresholds. Live: EvalTag rules on the tracer apply the same rubrics to sampled production spans. A regression that escapes CI gets caught by the live stream within hours, not weeks.
Watch retries as a cost signal, not only a reliability signal. Instructor’s retry-on-validation-failure default is good for uptime and bad for the bill. A prompt template whose mean retry count drifts from 1.1 to 2.4 has doubled in cost without changing in correctness. The retries are visible on the nested provider spans below the parent TOOL span. Chart the distribution per response_model; alert on shifts.
from fi.evals import Evaluator
from fi.evals.templates import (
LLMFunctionCalling,
Completeness,
CustomLLMJudge,
)
evaluator = Evaluator(fi_api_key=API_KEY, fi_secret_key=SECRET_KEY)
cases = [
{
"input": row["prompt"],
"output": row["instructor_object"].model_dump_json(),
"expected_output": row["golden_object"].model_dump_json(),
"metadata": {"prompt_template": row["template_id"]},
}
for row in production_replay_300
]
results = evaluator.evaluate(
eval_templates=[
LLMFunctionCalling(),
Completeness(),
severity_judge,
summary_judge,
],
inputs=cases,
)
The CI gate is LLMFunctionCalling >= 0.90, SeverityFidelity >= 4.5 (1-5), SummaryGroundedness >= 4.0, Completeness >= 0.85, and zero cross-field consistency violations on the golden set. The build fails on the axis that broke. The fix lands on the field the axis names.
Error Feed clusters you will actually see
CI is necessary, not sufficient. A 300-case offline set is a snapshot; production is a river. Error Feed sits inside the eval stack and turns the river into named clusters with proposed fixes.
The pipeline is HDBSCAN soft-clustering over span embeddings in ClickHouse, then a Claude Sonnet 4.5 JudgeAgent with a 30-turn budget across eight span-tools (plus a Haiku Chauffeur for spans over 3000 characters; prompt-cache hit ratio around 90 percent). Per cluster, the Judge writes a 5-category 30-subtype taxonomy entry, the 4-D trace score (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution; 1-5 each), and an immediate_fix naming the change to ship today.
On Instructor workloads the clusters tend to be field-shaped, which is the entire point of scoring per field:
- Per-field semantic drift. The
severityenum keeps skewing tourgenton a prompt template, ortierskews toenterpriseon inputs that mention no company.immediate_fixis a tightened rubric example, a one-shot, or a PydanticFielddescription rewrite. - Cross-field consistency violations.
Invoice.totaldisagrees withsum(line_items)on orders over 1,000 dollars;start_date >= end_dateon multi-month bookings.immediate_fixis a new@model_validatorassertion that turns the silent failure into a retry. - Numeric drift inside the range.
Customer.agereturning 35 when the input never mentioned age;Order.shipping_costreturning a flat 10 on every domestic order.immediate_fixis a narrowerFieldconstraint or a plausibility judge. - Completeness drift. Optional fields that should have been populated coming back empty on a sub-template the model now treats as out of scope.
immediate_fixis usually a few-shot example or a prompt-template clarification.
Linear OAuth is wired today; Slack, GitHub, Jira, and PagerDuty are on the roadmap. Clustering and the Judge’s immediate_fix are not gated on Linear; the value is the named cluster and the fix. Representative cluster rows promote into the regression set, so every escaped failure becomes a test the next CI run catches.
How Future AGI ships the Instructor eval stack
Three surfaces, one loop.
ai-evaluation SDK (Apache 2.0) ships the Evaluator, 60+ EvalTemplate classes (TaskCompletion, EvaluateFunctionCalling aliased as LLMFunctionCalling, AnswerRefusal, Completeness, Groundedness, ChunkAttribution, ContextAdherence, plus 11 CustomerAgent* templates), CustomLLMJudge for the per-field rubrics, and 20+ local heuristic metrics that run sub-second with zero API cost.
traceAI (Apache 2.0) ships the InstructorInstrumentor with one-call setup, the OpenAI and Anthropic instrumentors that nest below it, the Protect guardrail wrapper that attaches when ai-evaluation is installed, and 50+ other instrumentors across Python, TypeScript, Java, and C#. EvalTag attaches per-field rubrics to spans for server-side scoring.
Future AGI Platform ships self-improving evaluators tuned by feedback, in-product custom rubric authoring, and classifier-backed scoring at lower per-eval cost than Galileo Luna-2. Error Feed sits inside with HDBSCAN clustering, the Sonnet 4.5 JudgeAgent, the 5-category 30-subtype taxonomy, the 4-D trace score, and the immediate_fix artifact.
agent-opt closes the loop. The six optimizers (RandomSearchOptimizer, BayesianSearchOptimizer, MetaPromptOptimizer, ProTeGi, GEPAOptimizer, PromptWizardOptimizer) consume the per-field judge scores as separate study targets, so the prompt and the few-shot examples tune against the field that actually broke. The trace-stream-to-agent-opt connector is the active roadmap item; the eval-driven path ships today.
Honest tradeoff: if your response_model has three fields and no cross-field relationships, schema-only testing is enough. The three-layer stack earns its weight when the model has enough fields, enough enums, and enough numeric arithmetic that schema-pass-rate stops being a useful number.
Anti-patterns to avoid
Four mistakes that recur on real Instructor evaluation programs.
Schema-validity-only as the eval metric. The schema pass rate is a floor metric. Treating it as the ceiling produces a system that ships wrong values confidently because every test is green. Add at least one per-field judge per response_model.
Monolithic “is this object correct” judges. A single 1-5 score across the whole object is uninterpretable. Did severity break or did summary? Per-field judges produce per-field signal; per-field signal names the fix.
No cross-field assertions. The bug that schema-only evaluation cannot see is the bug that spans fields. Every response_model with arithmetic, date ranges, or implication relationships needs @model_validator cross-field checks. They are deterministic, free, and they pull failures back into Instructor’s retry loop.
Confusing retries with success. Instructor’s max-retries default keeps the application running through validation failures. It is good for reliability and bad for the bill. A prompt template whose retry distribution shifted is a cost regression even when the final pass rate is unchanged.
What to do this week
Five steps, one response_model.
- Wire
InstructorInstrumentor().instrument(tracer_provider=trace_provider)into your project. Verify theTOOLspan shows up with the parsed object onoutput.valueand the underlying provider span nested below. - Add
@model_validator(mode="after")to oneresponse_modeland write the cross-field assertions for that model. Ship them. Watch the retry rate. - Build a 300-row golden set from the last 30 days of production retries. Tag each row with the
response_modeland the prompt template. - Define one
CustomLLMJudgeper field with non-trivial semantics. RunLLMFunctionCalling,Completeness, and the per-field judges over the golden set. Wire per-axis CI thresholds. - Turn on Error Feed. Watch the first week’s clusters. Promote representative rows into the regression set. Run a
BayesianSearchOptimizerstudy on the prompt template whose field-level judge scored worst.
The teams shipping reliable Instructor applications in 2026 stopped reporting schema-pass-rate and started reporting per-field judge scores. The library gives you the shape for free. The eval stack gives you the content, the consistency, and the cost.
Related reading
- Evaluating OpenAI Agents SDK: The Handoff Is the Test (2026)
- The Definitive Guide to AI Agent Evaluation (2026)
- LLM Function Calling Evaluation (2025)
- LLM Judge Prompt Engineering Guide (2026)
- Deterministic vs LLM Judge Evals (2026)
- Agent Passes Evals Fails Production (2026)
- Automated Optimization for Agent (2026)
Frequently asked questions
Instructor already validates outputs with Pydantic. Why do I need a separate evaluation layer?
What is the right eval stack for an Instructor application?
How does traceAI instrument Instructor?
Which Future AGI evaluation templates apply to Instructor outputs?
What is the cross-field consistency bug and why do schema-only evals miss it?
Why do LLMs lie inside well-typed floats and ints?
How does Error Feed cluster Instructor failures?
Evaluate Pydantic AI agents that call MCP tools in 2026: per-typed-output rubrics, tool-call argument fidelity, MCP security checks, dependency invariants.
Evaluating LLM agent handoffs in 2026: the handoff is the cross-framework eval unit. Four rubrics, per-handoff spans, CI gates, and Error Feed clustering.
Compare OpenAI strict, Anthropic JSON, Gemini schema, and Outlines grammar-constrained generation: schema-validity rate, quality tax, failure modes.