Guides

Evaluating Instructor Structured Outputs (2026)

Q: How does traceAI instrument Instructor?

The `InstructorInstrumentor` (verified at `traceAI/python/frameworks/instructor/traceai_instructor/_wrappers.py`) wraps `instructor.core.client.Instructor.create` and `AsyncInstructor.create` via `wrap_function_wrapper`, plus the legacy `instructor.patch` entry point. Every patched `create` call emits an OTel span with `fi.span.kind=TOOL`, the request kwargs (model, response_model class, messages, max_retries) serialized as the `input.value` JSON, and the parsed Pydantic instance serialized as the `output.value` JSON. Because Instructor 1.x runs retries inside the same `create` call, the wrapper captures one logical structured-output span per user-issued call, with the final parsed object on `output.value`. The underlying LLM round trips ride on the OpenAI or Anthropic instrumentor whose spans nest inside; that is where token counts and per-attempt latency live. The Protect guardrail wrapper attaches when `ai-evaluation` is installed.

Q: Which Future AGI evaluation templates apply to Instructor outputs?

Three carry the load. `EvaluateFunctionCalling` (aliased as `LLMFunctionCalling`) scores whether the field assignments match user intent — the right rubric when the structured output is functionally a tool call. `CustomLLMJudge` carries the per-field rubrics: one judge per field with non-trivial semantics, each scoring on the same 1-5 scale with reasoning. `Completeness` scores whether optional fields that should have been populated are populated. Layer `Groundedness` and `ChunkAttribution` when a field is a model-generated summary of retrieved context. Cross-field consistency is deterministic and stays in pure Python — there is no LLM judge for `total == sum(line_items)`; the assertion is the rubric.

Q: What is the cross-field consistency bug and why do schema-only evals miss it?

A schema check fires per field. A consistency bug spans two or more fields. An invoice with `total: 199.99` and `line_items: [{price: 49.99}, {price: 49.99}]` passes every Pydantic check on every field individually — the floats are floats, the list is a list, the values are inside Field constraints — and the invoice is still arithmetically wrong by 100 dollars. A booking with `start_date: 2026-08-15` and `end_date: 2026-08-10` passes per-field type checks and represents a trip that ends before it starts. A medical triage row with `severity: 'urgent'` and `escalation_required: False` passes both Literal checks and contradicts itself. The fix is a pre-judge deterministic gate: a list of cross-field assertions per response_model, each one a one-line Python expression that fails fast. Cheap to write, cheap to run, catches the failures nobody sees in unit tests.

Q: Why do LLMs lie inside well-typed floats and ints?

The model is sampling tokens, not computing arithmetic. A field typed `float` constrains the surface but not the value. On a `Customer.age` field with `Field(ge=0, le=120)`, the model can return any value in that range — and on prompts where the input never mentions an age, it tends to return a confidently wrong number like 35 because the average customer in its training corpus is around there. The same dynamic shows up on `Invoice.total`, `Order.shipping_cost`, `Prescription.dosage_mg`, and any other numeric field. Three checks blunt the problem: a per-field range assertion tighter than the Pydantic constraint, a per-field judge that scores plausibility against the input, and a cross-field arithmetic check where one numeric field is derivable from others. Numeric drift inside a well-typed range is the failure that schema validation will never see.

Q: How does Error Feed cluster Instructor failures?

Error Feed sits inside the eval stack — HDBSCAN soft-clustering over the failing-row span embeddings in ClickHouse, then a Claude Sonnet 4.5 JudgeAgent with a 30-turn budget and eight span-tools writes one `immediate_fix` per cluster. For Instructor workloads the clusters tend to be field-shaped: a per-field semantic-drift cluster (the `severity` enum keeps skewing to `urgent` on a particular prompt template), a cross-field-consistency cluster (the `total` field disagrees with `sum(line_items)` on invoices over 1000 dollars), a numeric-out-of-range cluster (`age` returning 35 when the input never mentioned age), and a completeness cluster (optional fields that should have been populated coming back empty). The Judge writes the 5-category 30-subtype taxonomy entry, the 4-D trace score (`factual_grounding`, `privacy_and_safety`, `instruction_adherence`, `optimal_plan_execution`; 1-5 each), and the `immediate_fix` — which on Instructor work is usually a tightened Field constraint, a new Pydantic validator, a cross-field assertion to add, or a one-shot example to inject. Linear OAuth is wired today; Slack, GitHub, Jira, and PagerDuty are roadmap.

Evaluating Instructor structured outputs in 2026: per-field rubrics, cross-field consistency, numeric drift, and traceAI instrumentation.

April 26, 2026

Updated May 20, 2026

13 min read

instructor structured-outputs pydantic llm-evaluation traceai function-calling python 2026

Table of Contents

A customer-extraction agent on Instructor ships at 98.7 percent schema-pass rate. Three weeks later the schema-pass rate is still 98.7 percent. But Customer.age keeps coming back at 35 on prompts that never mention an age, Invoice.total disagrees with sum(line_items) on roughly 4 percent of orders, and the severity: 'urgent' cases all carry escalation_required: False. Every Pydantic check passes. Every value is the wrong value. The eval that watches the schema is blind to all three.

Structured outputs guarantee the shape. They do not guarantee the content. Pydantic answers one question: is this JSON well-formed against the model. The questions that break in production are different ones: are the per-field values correct, do the fields agree with each other, do the numbers add up. This post is the working pattern for evaluating Instructor in 2026: schema validation as the free floor, per-field semantic rubrics as the actual work, cross-field consistency as the bug nobody runs in unit tests, plus the InstructorInstrumentor that makes every call legible and the Error Feed loop that clusters failures by field.

Why schema-valid is not eval-valid

Instructor’s pitch is that you stop parsing JSON. You write a Pydantic model, pass it as response_model, and the library handles function calling, parsing, and retry-on-validation-failure. By the time the call returns you have a typed object. The shape is guaranteed.

That is also where the evaluation usually stops. The reasoning sounds clean: Pydantic already checks the output, so what is left to score. Plenty. The schema check is one assertion per field, and it is the wrong one.

Consider this object, returned for the prompt “Extract the customer from: ‘Hi, I’d like to update my email to a@b.com.’”:

Customer(
    id=2147483647,
    age=204,
    signup_date=date(1812, 4, 19),
    email="a@b.com",
    tier="enterprise",
)

Every field passes Pydantic. id is an int, age is an int, signup_date is a date, email matches EmailStr, tier is one of the four Literal values. The schema-pass-rate metric reports 100 percent. The object is wrong on four fields and the only correct one is email.

This is the gap. Schema validation is a question about syntax. The model passes syntax effortlessly because the function-calling protocol writes the JSON for it. The questions that matter are semantic: is the value plausible for the input, does it agree with the other fields, does it sum correctly when arithmetic is implied.

The eval stack splits into three layers that map to the three classes of question. Schema validation is free; Instructor does it. Per-field semantic rubrics are the work, one judge per field with non-trivial semantics. Cross-field consistency is the bug everyone misses, a list of deterministic assertions spanning two or more fields. The remainder of this post is each layer with the working code.

Per-field rubric design: typed fields need typed judges

A Field(min_length=10, max_length=200) constraint says the string is between 10 and 200 characters. It says nothing about whether the string is on topic. A Literal["low", "normal", "high", "urgent"] says the value is one of four strings. It says nothing about which of the four is correct for the input. The semantic check is the part you write.

The rule of thumb: any field whose correctness depends on the input gets a CustomLLMJudge. Pure-structure fields (a UUID, a fixed-format date when the format is the entire spec) do not. Anything else does.

from pydantic import BaseModel, Field
from typing import Literal
from datetime import date

class TicketDraft(BaseModel):
    summary: str = Field(min_length=10, max_length=200)
    severity: Literal["low", "normal", "high", "urgent"]
    customer_id: str
    suggested_kb_articles: list[str] = Field(default_factory=list)
    escalation_required: bool

summary needs a judge: it is free text that must reference the actual issue, and a 47-character lorem-ipsum string would pass the length check. severity needs a judge: the Literal accepts any of four values, and the right value depends on the input. customer_id does not need a judge if its format is fully specified. suggested_kb_articles needs a judge for relevance. escalation_required is the one that needs a cross-field check (next section), not a single-field judge.

The judge pattern is one rubric per field, each rubric narrow enough that the score is unambiguous. A monolithic “is this object correct” judge produces uninterpretable averages. A per-field judge tells you which field is failing, which is the diagnostic that names the fix.

from fi.evals import Evaluator
from fi.evals.templates import CustomLLMJudge

evaluator = Evaluator(fi_api_key="...", fi_secret_key="...")

severity_judge = CustomLLMJudge(
    name="SeverityFidelity",
    rubric=(
        "Given the user input and the chosen severity value, score whether the "
        "severity reflects the case. 'low' for routine questions, 'normal' for "
        "single-customer issues, 'high' for outages affecting multiple users, "
        "'urgent' for safety, security, or revenue-blocking incidents. "
        "Score 1-5. 5 = correct. 1 = off by two levels or more."
    ),
    input_mapping={
        "user_input": "input",
        "severity_value": "output.severity",
    },
)

summary_judge = CustomLLMJudge(
    name="SummaryGroundedness",
    rubric=(
        "Score whether the summary references the actual issue in the input. "
        "5 = summary accurately describes the issue. 1 = summary is unrelated, "
        "generic, or contains content not in the input."
    ),
    input_mapping={
        "user_input": "input",
        "summary_value": "output.summary",
    },
)

Per-field thresholds in CI follow the same logic. SeverityFidelity runs at >= 4.5 on a 1-5 scale because miscategorising urgency has real cost. SummaryGroundedness runs at >= 4.0 because the text has more variation. A flat aggregate hides the field that is breaking; per-axis thresholds fail the build on the axis that actually drifted. One bisect instead of three.

Cross-field consistency: the bug everyone misses

Schema validation fires per field. Cross-field bugs span fields. They are the failure mode that schema-only tests structurally cannot catch, and they are the most common silent failure in Instructor work.

Three shapes recur. The arithmetic shape: a derived field disagrees with its sources (Invoice.total != sum(line_items.price), Order.shipping_total != sum(packages.shipping_cost)). The temporal shape: a date-range invariant breaks (booking.end_date < booking.start_date, subscription.canceled_at < subscription.started_at). The logical-implication shape: a flag and a category disagree (severity == 'urgent' but escalation_required is False, status == 'shipped' but tracking_number is None).

Every one of these passes per-field Pydantic. Every one of these breaks the downstream consumer that assumed the relationship held. Every one of these is one line of Python to check.

from pydantic import BaseModel, Field, model_validator
from typing import Literal
from datetime import date

class LineItem(BaseModel):
    sku: str
    quantity: int = Field(gt=0)
    unit_price: float = Field(ge=0)

class Invoice(BaseModel):
    invoice_id: str
    customer_id: str
    line_items: list[LineItem]
    total: float = Field(ge=0)
    issued_at: date
    due_at: date

    @model_validator(mode="after")
    def _consistency(self) -> "Invoice":
        derived = round(sum(li.quantity * li.unit_price for li in self.line_items), 2)
        if abs(self.total - derived) > 0.01:
            raise ValueError(
                f"total {self.total} disagrees with sum(line_items) {derived}"
            )
        if self.due_at < self.issued_at:
            raise ValueError(f"due_at {self.due_at} precedes issued_at {self.issued_at}")
        return self

@model_validator(mode="after") runs after every field has been validated, sees the whole object, and raises on any cross-field invariant. Instructor catches the ValueError and retries the LLM with the validator message in the prompt, so the model has the chance to fix its arithmetic before the call returns. The check is deterministic, runs at parse time, and pulls a non-trivial fraction of production failures back inside Instructor’s existing retry loop where they belong.

For consistency rules that need a model (a free-text field that should be consistent with a structured field, for instance), the check still runs deterministically at eval time as an assertion, not as a judge. Save the judge calls for the genuinely fuzzy questions; the deterministic vs LLM-judge evals split makes the case for layering both.

def cross_field_checks(obj: TicketDraft) -> list[str]:
    errors = []
    if obj.severity in {"high", "urgent"} and not obj.escalation_required:
        errors.append(
            f"severity={obj.severity} implies escalation_required=True"
        )
    if obj.severity == "low" and obj.escalation_required:
        errors.append("severity=low contradicts escalation_required=True")
    if obj.suggested_kb_articles and obj.severity == "urgent":
        # urgent cases should not be self-serve-routed to KB
        errors.append("urgent cases should not return suggested_kb_articles")
    return errors

Run this as a deterministic pre-pass. Rows that fail short-circuit straight to the failure bucket without touching a judge. The CI report names the assertion that fired, which is the artifact that names the fix.

Numeric correctness: LLMs lie inside well-typed floats

The most under-rated failure mode on structured outputs is numeric drift. The model is sampling tokens, not computing arithmetic, and a Field(ge=0, le=120) constraint on age says only that the number is between 0 and 120. The model can return 35 every time and pass every check. On prompts that never mention an age, that is what it tends to do — the average customer in its training corpus is around there.

Three checks blunt the problem. None of them are an LLM judge.

Tighten the range. Pydantic constraints are typically set by what is physically possible (age <= 120). The constraint that catches drift is set by what is plausible for this prompt. A user-onboarding flow rarely sees customers under 13 or over 90, so Field(ge=13, le=90) is a sharper floor. The narrower the range, the more drift it catches at parse time.

Cross-field arithmetic. When one numeric field is derivable from others, derive it and assert equality. Invoice.total == sum(line_items.quantity * line_items.unit_price). Order.discount_amount == subtotal * discount_percent / 100. Trip.duration_days == (end_date - start_date).days. Each of these is one line, runs at parse time, and catches a class of failure no LLM judge will reproduce reliably.

Numeric plausibility judge. For numbers that are not derivable, a narrow CustomLLMJudge scores whether the value is plausible for the input. Same pattern as the per-field semantic judges, except the score weights statistical realism rather than enum correctness.

from pydantic import BaseModel, Field, model_validator

class Order(BaseModel):
    items: list[LineItem]
    subtotal: float = Field(ge=0)
    discount_percent: float = Field(ge=0, le=100)
    discount_amount: float = Field(ge=0)
    shipping_cost: float = Field(ge=0)
    total: float = Field(ge=0)

    @model_validator(mode="after")
    def _math(self) -> "Order":
        derived_subtotal = round(
            sum(li.quantity * li.unit_price for li in self.items), 2
        )
        if abs(self.subtotal - derived_subtotal) > 0.01:
            raise ValueError(f"subtotal != sum(items): {self.subtotal} vs {derived_subtotal}")
        derived_discount = round(self.subtotal * self.discount_percent / 100, 2)
        if abs(self.discount_amount - derived_discount) > 0.01:
            raise ValueError(f"discount_amount != subtotal * discount_percent")
        derived_total = round(self.subtotal - self.discount_amount + self.shipping_cost, 2)
        if abs(self.total - derived_total) > 0.01:
            raise ValueError(f"total != subtotal - discount + shipping")
        return self

Three lines of arithmetic, three failure modes that schema-only validation will never see. The model_validator raises, Instructor retries with the error in the prompt, the model usually fixes its arithmetic on attempt two. The retries do show up in your bill, so track them per prompt template to keep the cost surface visible.

traceAI’s InstructorInstrumentor: what the spans actually carry

You cannot evaluate what you cannot see. The starting point for any production Instructor program is making every call legible as an OTel span tree. The InstructorInstrumentor (verified at traceAI/python/frameworks/instructor/traceai_instructor/__init__.py) does it with one call after the standard register.

pip install instructor traceai-instructor ai-evaluation

import os
os.environ["FI_API_KEY"] = "..."
os.environ["FI_SECRET_KEY"] = "..."

from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_instructor import InstructorInstrumentor

trace_provider = register(
    project_type=ProjectType.OBSERVE,
    project_name="instructor-app",
    project_version_name="v1.4.0",
)
InstructorInstrumentor().instrument(tracer_provider=trace_provider)

The instrumentor wraps instructor.core.client.Instructor.create and AsyncInstructor.create, plus the legacy instructor.patch entry. Every patched call emits a span with fi.span.kind=TOOL. The request kwargs (model, response_model, messages, max_retries) serialize as the input.value JSON; the parsed Pydantic instance serializes as the output.value JSON in application/json MIME. The underlying LLM round trips ride on the OpenAI or Anthropic instrumentor below this one in the tree — token counts, per-attempt latency, and the raw completion content live there. If ai-evaluation is installed, the Protect guardrail wrapper attaches automatically so any Protect.protect call inside a tool inherits the span tree.

Instructor 1.x runs retries inside the same create call, so one logical structured-output operation appears as one parent TOOL span with the final parsed object on output.value and the underlying provider attempts nested below. That is the right granularity for evaluation aggregation — you score per logical call, not per HTTP round trip.

Pair the instrumentor with the EvalTag mechanism on the tracer and the per-field judges attach server-side. SeverityFidelity reads only spans whose output.value carries a severity field; total != sum(line_items) rows route to the consistency failure bucket. No application-side polling, no inline latency, and the production stream gets the same rubric stack as the offline regression set.

Production patterns: what the loop looks like end to end

The order of operations that works on Instructor projects.

Layer the gates cheapest-first. Pydantic schema validation and model_validator cross-field checks run at parse time inside Instructor at no token cost. Deterministic post-parse assertions (range tightening, regex checks) run next. Per-field CustomLLMJudge calls run last, only on rows the deterministic gates let through. Judges are the expensive token line item; push as much catch as possible to the free layers.

Build the golden set from production retries. 300 rows where each is a prompt that produced two or more attempts in the last 30 days of live traffic. That is where the model is closest to its limits and where the rubric needs the most resolution. Generic synthetic data underweights the prompts that actually fail.

Score offline and live with the same rubrics. Offline: evaluator.evaluate(...) over the golden set, gated in CI on per-axis thresholds. Live: EvalTag rules on the tracer apply the same rubrics to sampled production spans. A regression that escapes CI gets caught by the live stream within hours, not weeks.

Watch retries as a cost signal, not only a reliability signal. Instructor’s retry-on-validation-failure default is good for uptime and bad for the bill. A prompt template whose mean retry count drifts from 1.1 to 2.4 has doubled in cost without changing in correctness. The retries are visible on the nested provider spans below the parent TOOL span. Chart the distribution per response_model; alert on shifts.

from fi.evals import Evaluator
from fi.evals.templates import (
    LLMFunctionCalling,
    Completeness,
    CustomLLMJudge,
)

evaluator = Evaluator(fi_api_key=API_KEY, fi_secret_key=SECRET_KEY)

cases = [
    {
        "input": row["prompt"],
        "output": row["instructor_object"].model_dump_json(),
        "expected_output": row["golden_object"].model_dump_json(),
        "metadata": {"prompt_template": row["template_id"]},
    }
    for row in production_replay_300
]

results = evaluator.evaluate(
    eval_templates=[
        LLMFunctionCalling(),
        Completeness(),
        severity_judge,
        summary_judge,
    ],
    inputs=cases,
)

The CI gate is LLMFunctionCalling >= 0.90, SeverityFidelity >= 4.5 (1-5), SummaryGroundedness >= 4.0, Completeness >= 0.85, and zero cross-field consistency violations on the golden set. The build fails on the axis that broke. The fix lands on the field the axis names.

Error Feed clusters you will actually see

CI is necessary, not sufficient. A 300-case offline set is a snapshot; production is a river. Error Feed sits inside the eval stack and turns the river into named clusters with proposed fixes.

The pipeline is HDBSCAN soft-clustering over span embeddings in ClickHouse, then a Claude Sonnet 4.5 JudgeAgent with a 30-turn budget across eight span-tools (plus a Haiku Chauffeur for spans over 3000 characters; prompt-cache hit ratio around 90 percent). Per cluster, the Judge writes a 5-category 30-subtype taxonomy entry, the 4-D trace score (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution; 1-5 each), and an immediate_fix naming the change to ship today.

On Instructor workloads the clusters tend to be field-shaped, which is the entire point of scoring per field:

Per-field semantic drift. The severity enum keeps skewing to urgent on a prompt template, or tier skews to enterprise on inputs that mention no company. immediate_fix is a tightened rubric example, a one-shot, or a Pydantic Field description rewrite.
Cross-field consistency violations. Invoice.total disagrees with sum(line_items) on orders over 1,000 dollars; start_date >= end_date on multi-month bookings. immediate_fix is a new @model_validator assertion that turns the silent failure into a retry.
Numeric drift inside the range. Customer.age returning 35 when the input never mentioned age; Order.shipping_cost returning a flat 10 on every domestic order. immediate_fix is a narrower Field constraint or a plausibility judge.
Completeness drift. Optional fields that should have been populated coming back empty on a sub-template the model now treats as out of scope. immediate_fix is usually a few-shot example or a prompt-template clarification.

Linear OAuth is wired today; Slack, GitHub, Jira, and PagerDuty are on the roadmap. Clustering and the Judge’s immediate_fix are not gated on Linear; the value is the named cluster and the fix. Representative cluster rows promote into the regression set, so every escaped failure becomes a test the next CI run catches.

How Future AGI ships the Instructor eval stack

Three surfaces, one loop.

ai-evaluation SDK (Apache 2.0) ships the Evaluator, 60+ EvalTemplate classes (TaskCompletion, EvaluateFunctionCalling aliased as LLMFunctionCalling, AnswerRefusal, Completeness, Groundedness, ChunkAttribution, ContextAdherence, plus 11 CustomerAgent* templates), CustomLLMJudge for the per-field rubrics, and 20+ local heuristic metrics that run sub-second with zero API cost.

traceAI (Apache 2.0) ships the InstructorInstrumentor with one-call setup, the OpenAI and Anthropic instrumentors that nest below it, the Protect guardrail wrapper that attaches when ai-evaluation is installed, and 50+ other instrumentors across Python, TypeScript, Java, and C#. EvalTag attaches per-field rubrics to spans for server-side scoring.

Future AGI Platform ships self-improving evaluators tuned by feedback, in-product custom rubric authoring, and classifier-backed scoring at lower per-eval cost than Galileo Luna-2. Error Feed sits inside with HDBSCAN clustering, the Sonnet 4.5 JudgeAgent, the 5-category 30-subtype taxonomy, the 4-D trace score, and the immediate_fix artifact.

agent-opt closes the loop. The six optimizers (RandomSearchOptimizer, BayesianSearchOptimizer, MetaPromptOptimizer, ProTeGi, GEPAOptimizer, PromptWizardOptimizer) consume the per-field judge scores as separate study targets, so the prompt and the few-shot examples tune against the field that actually broke. The trace-stream-to-agent-opt connector is the active roadmap item; the eval-driven path ships today.

Honest tradeoff: if your response_model has three fields and no cross-field relationships, schema-only testing is enough. The three-layer stack earns its weight when the model has enough fields, enough enums, and enough numeric arithmetic that schema-pass-rate stops being a useful number. The same per-field discipline carries over to native structured output modes when you are not on Instructor.

Anti-patterns to avoid

Four mistakes that recur on real Instructor evaluation programs.

Schema-validity-only as the eval metric. The schema pass rate is a floor metric. Treating it as the ceiling produces a system that ships wrong values confidently because every test is green. Add at least one per-field judge per response_model.

Monolithic “is this object correct” judges. A single 1-5 score across the whole object is uninterpretable. Did severity break or did summary? Per-field judges produce per-field signal; per-field signal names the fix.

No cross-field assertions. The bug that schema-only evaluation cannot see is the bug that spans fields. Every response_model with arithmetic, date ranges, or implication relationships needs @model_validator cross-field checks. They are deterministic, free, and they pull failures back into Instructor’s retry loop.

Confusing retries with success. Instructor’s max-retries default keeps the application running through validation failures. It is good for reliability and bad for the bill. A prompt template whose retry distribution shifted is a cost regression even when the final pass rate is unchanged.

What to do this week

Five steps, one response_model.

Wire InstructorInstrumentor().instrument(tracer_provider=trace_provider) into your project. Verify the TOOL span shows up with the parsed object on output.value and the underlying provider span nested below.
Add @model_validator(mode="after") to one response_model and write the cross-field assertions for that model. Ship them. Watch the retry rate.
Build a 300-row golden set from the last 30 days of production retries. Tag each row with the response_model and the prompt template.
Define one CustomLLMJudge per field with non-trivial semantics. Run LLMFunctionCalling, Completeness, and the per-field judges over the golden set. Wire per-axis CI thresholds.
Turn on Error Feed. Watch the first week’s clusters. Promote representative rows into the regression set. Run a BayesianSearchOptimizer study on the prompt template whose field-level judge scored worst.

The teams shipping reliable Instructor applications in 2026 stopped reporting schema-pass-rate and started reporting per-field judge scores. The library gives you the shape for free. The eval stack gives you the content, the consistency, and the cost.

Frequently asked questions

Instructor already validates outputs with Pydantic. Why do I need a separate evaluation layer?

Pydantic validates the shape of the object. It does not validate the content. A `Customer(id=2147483647, age=204, signup_date=date(1812, 4, 19))` passes every type check, every Literal constraint, every Field range, and is still wrong. Schema validation answers the easy question: did the model return JSON that matches the spec. The questions that fail in production are different: did the model pick the correct enum value, did the numeric field carry a value that makes sense for the input, do `start_date` and `end_date` line up, does `total` actually equal the sum of `line_items`. Pydantic gives you the floor for free. The eval that matters is the ceiling: per-field semantic rubrics, cross-field consistency assertions, and numeric range checks that catch the values Instructor cannot see are wrong.

What is the right eval stack for an Instructor application?

Three layers. Schema validation is free — Instructor and Pydantic do it on every call. Per-field semantic rubrics are the work — a `CustomLLMJudge` rubric per field with non-trivial semantics (enums where the choice depends on context, numeric ranges that depend on input, free-text fields that must reference the actual user request). Cross-field consistency assertions are the bug everyone misses — pure-Python checks that two fields agree (`start_date < end_date`, `total == sum(line_items)`, `severity in {high, urgent}` implies `escalation_required is True`). The deterministic checks run as a pre-pass and short-circuit the expensive judge calls. The judge runs only on rows the deterministic gates let through.

How does traceAI instrument Instructor?

The `InstructorInstrumentor` (verified at `traceAI/python/frameworks/instructor/traceai_instructor/_wrappers.py`) wraps `instructor.core.client.Instructor.create` and `AsyncInstructor.create` via `wrap_function_wrapper`, plus the legacy `instructor.patch` entry point. Every patched `create` call emits an OTel span with `fi.span.kind=TOOL`, the request kwargs (model, response_model class, messages, max_retries) serialized as the `input.value` JSON, and the parsed Pydantic instance serialized as the `output.value` JSON. Because Instructor 1.x runs retries inside the same `create` call, the wrapper captures one logical structured-output span per user-issued call, with the final parsed object on `output.value`. The underlying LLM round trips ride on the OpenAI or Anthropic instrumentor whose spans nest inside; that is where token counts and per-attempt latency live. The Protect guardrail wrapper attaches when `ai-evaluation` is installed.

Which Future AGI evaluation templates apply to Instructor outputs?

Three carry the load. `EvaluateFunctionCalling` (aliased as `LLMFunctionCalling`) scores whether the field assignments match user intent — the right rubric when the structured output is functionally a tool call. `CustomLLMJudge` carries the per-field rubrics: one judge per field with non-trivial semantics, each scoring on the same 1-5 scale with reasoning. `Completeness` scores whether optional fields that should have been populated are populated. Layer `Groundedness` and `ChunkAttribution` when a field is a model-generated summary of retrieved context. Cross-field consistency is deterministic and stays in pure Python — there is no LLM judge for `total == sum(line_items)`; the assertion is the rubric.

What is the cross-field consistency bug and why do schema-only evals miss it?

A schema check fires per field. A consistency bug spans two or more fields. An invoice with `total: 199.99` and `line_items: [{price: 49.99}, {price: 49.99}]` passes every Pydantic check on every field individually — the floats are floats, the list is a list, the values are inside Field constraints — and the invoice is still arithmetically wrong by 100 dollars. A booking with `start_date: 2026-08-15` and `end_date: 2026-08-10` passes per-field type checks and represents a trip that ends before it starts. A medical triage row with `severity: 'urgent'` and `escalation_required: False` passes both Literal checks and contradicts itself. The fix is a pre-judge deterministic gate: a list of cross-field assertions per response_model, each one a one-line Python expression that fails fast. Cheap to write, cheap to run, catches the failures nobody sees in unit tests.

Why do LLMs lie inside well-typed floats and ints?

The model is sampling tokens, not computing arithmetic. A field typed `float` constrains the surface but not the value. On a `Customer.age` field with `Field(ge=0, le=120)`, the model can return any value in that range — and on prompts where the input never mentions an age, it tends to return a confidently wrong number like 35 because the average customer in its training corpus is around there. The same dynamic shows up on `Invoice.total`, `Order.shipping_cost`, `Prescription.dosage_mg`, and any other numeric field. Three checks blunt the problem: a per-field range assertion tighter than the Pydantic constraint, a per-field judge that scores plausibility against the input, and a cross-field arithmetic check where one numeric field is derivable from others. Numeric drift inside a well-typed range is the failure that schema validation will never see.

How does Error Feed cluster Instructor failures?

Error Feed sits inside the eval stack — HDBSCAN soft-clustering over the failing-row span embeddings in ClickHouse, then a Claude Sonnet 4.5 JudgeAgent with a 30-turn budget and eight span-tools writes one `immediate_fix` per cluster. For Instructor workloads the clusters tend to be field-shaped: a per-field semantic-drift cluster (the `severity` enum keeps skewing to `urgent` on a particular prompt template), a cross-field-consistency cluster (the `total` field disagrees with `sum(line_items)` on invoices over 1000 dollars), a numeric-out-of-range cluster (`age` returning 35 when the input never mentioned age), and a completeness cluster (optional fields that should have been populated coming back empty). The Judge writes the 5-category 30-subtype taxonomy entry, the 4-D trace score (`factual_grounding`, `privacy_and_safety`, `instruction_adherence`, `optimal_plan_execution`; 1-5 each), and the `immediate_fix` — which on Instructor work is usually a tightened Field constraint, a new Pydantic validator, a cross-field assertion to add, or a one-shot example to inject. Linear OAuth is wired today; Slack, GitHub, Jira, and PagerDuty are roadmap.

View all

Guides

Evaluating Pydantic AI Agents That Use MCP Tools (2026)

Evaluate Pydantic AI agents that call MCP tools in 2026: per-typed-output rubrics, tool-call argument fidelity, MCP security checks, dependency invariants.

Vrinda Damani · May 21, 2026

11 min

Guides

Evaluating LLM Agent Handoffs (2026)

Evaluating LLM agent handoffs in 2026: the handoff is the cross-framework eval unit. Four rubrics, per-handoff spans, CI gates, and Error Feed clustering.

Nikhil Pareek · Apr 19, 2026

11 min

Guides

Evaluating Tool-Calling Agents in 2026: The Four-Layer Eval Stack

Tool-calling eval is four problems stacked: tool selection, argument extraction, result utilization, error recovery. Most posts grade only the first.

Vrinda Damani · Apr 19, 2026

12 min

Why schema-valid is not eval-valid

Per-field rubric design: typed fields need typed judges

Cross-field consistency: the bug everyone misses

Numeric correctness: LLMs lie inside well-typed floats

traceAI’s InstructorInstrumentor: what the spans actually carry

Production patterns: what the loop looks like end to end

Error Feed clusters you will actually see

How Future AGI ships the Instructor eval stack

Anti-patterns to avoid

What to do this week

Related reading

Frequently asked questions