Evaluating LLM Structured Output Modes (2026)
Compare OpenAI strict, Anthropic JSON, Gemini schema, and Outlines grammar-constrained generation: schema-validity rate, quality tax, failure modes.
Table of Contents
An extraction pipeline ships on OpenAI strict with a 99.4 percent schema-validity rate. Every JSON parses. Every Field constraint holds. Three weeks later the support team finds out that priority: 'urgent' was the chosen value on roughly 11 percent of inbound tickets when the right answer was 'normal' or 'high'. The model had collapsed to a safer default to keep the grammar happy. The schema check never had a chance.
Structured output modes guarantee the schema, not the quality. OpenAI strict, Anthropic JSON, Gemini responseSchema, and Outlines / JSONFormer for grammar-constrained generation all produce parseable output. They do not produce the same answer the model would have produced if you had let it write English. Each mode carries its own quality tax: sometimes zero, sometimes fifteen percent on hard prompts. The only eval that matters is schema_validity_rate × semantic_quality_on_passes, measured on your data, per mode. This post is the working pattern for that measurement, with the failure-mode catalog and the production loop that catches the tax before it ships.
Why mode comparison matters
Most teams pick a structured-output mode the way they pick a font. OpenAI ships strict, so the OpenAI client uses strict. Claude has a tool-use surface, so the Claude code path uses tool-use. Gemini has responseSchema, so Gemini gets responseSchema. The headline parse rate hits 99 and the code ships.
That pattern fails because the four modes are different systems with different failure profiles. The schema-validity numbers converge near 100. The semantic-quality numbers do not, and the gap is the entire post.
Three reasons mode choice deserves its own eval pass. First, the LLM function calling deep dive is the closest companion pattern but only covers tool-use; you can have structured output without a tool call and the modes diverge there. Second, the agent passes evals fails production failure shape often turns out to be mode-driven: a constrained decoder degrading silently on prompt templates the eval set under-samples. Third, retries are the silent cost line item. Outlines refusals, OpenAI strict’s occasional parse failures, and Instructor-style retry-on-validation-error all compound into a bill your dashboard does not surface unless you instrument for it.
The four modes: mechanics and guarantees
Read these as four implementations of the same interface, not four flavors of the same thing.
OpenAI strict (response_format={"type": "json_schema", "strict": true}) compiles the supplied JSON Schema into a constrained-decoding grammar at the inference layer. Every token sampled is filtered through a mask that keeps the partial output schema-legal. Guarantee: if the call returns, the JSON parses. Failure shape: optional fields silently dropped under the required-by-default behavior, refusals on schemas the decoder cannot represent (recursive types, dynamic property names), and on hard prompts the model’s preferred token gets masked out and the answer collapses to a safer enum default.
Anthropic JSON mode has two surfaces. The simple one is prompt-shaped: you ask the model to return JSON in the system prompt and parse the response. There is no token-level constraint; the guarantee is “best effort” and the failure shape is near-JSON output (trailing commentary, missing braces, comments). The richer surface is tool use, where input_schema becomes the contract and Anthropic validates the structured tool call. Tool use has higher schema-validity than prompt-shaped JSON mode and preserves more of the model’s free-form quality, because the constraint lives at the validation layer rather than the decoding layer. Known wart: on long inputs the model can stringify a nested sub-object to save tokens, leaving the parser with JSON inside a JSON string.
Gemini responseSchema sits between the two. The configuration accepts a subset of JSON Schema, the inference layer validates server-side, and violations come back as an error envelope rather than a generated string. Guarantee: the response is either schema-valid or explicitly rejected. Failure shape: union types with many members fail silently, deep nesting (more than three levels) increases the error rate, and the supported subset is narrower than OpenAI’s, so a schema that works on gpt-4o may not be expressible on gemini-1.5-pro.
Outlines and JSONFormer are grammar-constrained generators that run over open-weight models. They build a finite-state machine from the schema (or a richer grammar: regex, CFG, custom FSM) and at every decoding step mask the logits so only state-machine-legal tokens can be sampled. Guarantee is the strongest of the four: if the model generates a token, it is grammar-legal. Failure shape: the quality tax interacts with base-model capability. Strong open-weight models (Llama-3-70B, Mistral-Large) pay a small tax; weaker models collapse to whichever schema-legal continuation has the lowest perplexity, which is rarely the right answer. Outlines also supports grammars richer than JSON Schema, so you can constrain a chain-of-thought to a regular expression or a multi-step plan to a CFG. The deterministic vs LLM judge evals post covers where deterministic grammars pull weight judges cannot.
| Mode | Mechanism | Schema guarantee | Quality tax (rough) |
|---|---|---|---|
| OpenAI strict | Token-level constrained decoding | Strong (returns iff schema-legal) | 0-3 percent easy, 10-15 percent hard |
| Anthropic JSON / tool use | Prompt + post-validation | Medium (best effort), Strong (tool use) | 0-2 percent easy, 4-8 percent hard |
| Gemini responseSchema | Server-side schema validation | Strong (validated or rejected) | 0-3 percent easy, 8-12 percent hard |
| Outlines / JSONFormer | FSM logit masking | Strongest (grammar-legal tokens only) | Scales with base model; 5-20 percent |
These are ranges from customer workloads in the first half of 2026, not from a controlled benchmark. Treat them as the right order of magnitude on a 200-row golden set; run your own reproduction before treating them as your number.
Schema-validity rate per mode (the floor)
The first axis is the easy one: did the output parse. Score it deterministically, run it free, treat the result as the floor of the eval, not as a substitute for the rest.
import json
from pydantic import BaseModel, ValidationError
from fi.evals import Evaluator
from fi.evals.templates import EvaluateFunctionCalling
evaluator = Evaluator(fi_api_key="...", fi_secret_key="...")
def schema_validity(raw_text: str, response_model: type[BaseModel]) -> bool:
try:
response_model.model_validate_json(raw_text)
return True
except (ValidationError, json.JSONDecodeError):
return False
# Run across the golden set per (provider, mode) and aggregate.
results = evaluator.evaluate(
eval_templates=[EvaluateFunctionCalling()],
inputs=cases,
)
EvaluateFunctionCalling (verified at fi/evals/templates.py, eval_id="98") is the deterministic schema-compliance template. It scores whether the structured payload conforms to the declared function or response model; pair it with the Pydantic model_validate_json for the offline floor.
Two patterns to watch as you aggregate. First, the cells where the rate sits between 95 and 99 percent are the interesting ones. At 100 percent you have a mode that works; at 90 percent you have a mode that does not; at 97 you have a mode whose failures concentrate on a recognisable subset of inputs. Cluster those failures. Second, modes where the rate hits 100 percent with retries hidden behind the SDK (Instructor on top of OpenAI strict, for instance) are paying the cost on the retry line item. Chart the retry distribution per template alongside the headline rate.
The right unit is not “schema validity per call.” It is “schema validity per call after N retries, at cost C.” The companion evaluating Instructor structured outputs post covers the retry cost shape directly.
Semantic quality tax (the actual answer)
The harder axis is the one most teams skip. After the schema validates, are the values right?
The cleanest measurement is a head-to-head: same model, same prompt, free-form vs constrained. Run the prompt with no structured-output mode and grade the answer with a judge. Run the same prompt under strict / JSON / schema / Outlines and grade against the same rubric. The delta is the quality tax for that mode on that prompt.
from fi.evals.templates import CustomLLMJudge
semantic_judge = CustomLLMJudge(
name="ExtractionSemanticQuality",
rubric=(
"Score whether the extracted fields reflect the user input. "
"5 = every field accurate and well-grounded in the input. "
"3 = one or two fields off (wrong enum, wrong category, but plausible). "
"1 = multiple fields wrong or hallucinated. "
"Ignore JSON syntax; assume the schema parses. "
"Score the semantics."
),
input_mapping={
"user_input": "input",
"extracted_object": "output",
},
)
freeform_results = evaluator.evaluate(eval_templates=[semantic_judge], inputs=freeform_cases)
strict_results = evaluator.evaluate(eval_templates=[semantic_judge], inputs=strict_cases)
Average across the golden set, per mode. The mode-vs-freeform delta is the quality tax. Track it per prompt template, not just per mode. The tax on routine extraction is usually within noise; the tax on adversarial or long-tail prompts is where the mode choice earns its keep.
Three places the tax shows up loudest. Enum collapse: a Literal["low", "normal", "high", "urgent"] field where the model under constraint defaults to "normal" more often than free-form would. Numeric drift: a Field(ge=0, le=120) on age where the constrained model returns the corpus mean when the input never mentioned an age. Field dropping: optional fields silently omitted under strict modes because the constrained decoder treats them as zero-cost to skip. The LLM judge prompt engineering guide covers rubric patterns that catch enum collapse specifically.
Failure-mode catalog (refused, partial, truncated, degraded)
Four failure shapes recur, in the order they hurt.
Refused. The model returns a refusal or an error envelope. OpenAI strict raises a 400 on schemas the decoder cannot compile; Gemini returns an error on responseSchema violations; Outlines fails to find a schema-legal completion within the max-tokens budget. Refusals are visible (your error logs catch them) and the fix is usually a schema simplification.
Partial. Required fields populated, optional fields the model would have returned in free-form mode dropped. The most common silent failure on OpenAI strict and Gemini, because the constraint optimises for the cheapest schema-legal completion and an absent optional field is cheap. Eval signal: Completeness (verified at fi/evals/templates.py). Score the population rate of optional fields per mode and the gap appears.
Truncated. A deep nested object or long array hits max_tokens mid-emit. The JSON is invalid; the schema validator rejects it; the failure looks like a schema-validity miss. Mitigation: raise max_tokens and chunk long outputs across multiple calls. Eval signal: parse-error analysis on failing rows. A JSONDecodeError at the tail of a long array is a truncation, not a schema design bug.
Schema-valid but semantically degraded. Every field parses. Every Literal is in the allowed set. Every numeric field sits inside its Pydantic range. And the answer is wrong: the enum collapsed to a safe default, the numeric field returned the prior mean, the free-text summary referenced none of the input. The first three look like errors. The fourth looks like success and is the one that ships. The agent failure modes breakdown covers the agent-level taxonomy this slots into.
The right diagnostic is the per-field judge plus the cross-field assertion. Schema validity will not see the fourth failure mode. A semantic judge per field will.
Production patterns: what the loop looks like
The order of operations that holds up on real structured-output workloads.
Layer the gates cheapest-first. Schema validity is the deterministic floor: run it free, fail fast. Cross-field assertions (range tightening, sum-of-line-items checks, date-range invariants) run next, also deterministic, also free. Per-field semantic judges run last, only on rows the deterministic gates let through. Judges are the expensive line item; push as much catch as possible to the free layers.
Score the modes head-to-head on the same golden set. 200 to 500 prompts that mirror production depth and width. Run the same input under all four modes (OpenAI strict, Anthropic JSON / tool use, Gemini schema, Outlines if you run open-weight). Score the same rubric stack on each. The resulting matrix (mode on the rows, eval axis on the columns) is the artifact that decides the production routing.
Route per-template, not per-application. The cheapest mode that passes the rubric is the right mode for that template, not the right mode for the application. A simple extraction template might run cleanly on Gemini schema; an adversarial-input template might need OpenAI strict; a free-form-with-light-structure template might do better on Anthropic tool use. The AI gateway evaluation post covers per-template routing through the gateway abstraction.
Watch retries as a cost signal. Strict modes that re-prompt on parse failure, Instructor-style retry loops, and Outlines refusals all compound into a cost line item that does not show up in the headline schema-validity rate. Chart retry count per (mode, template) and alert on shifts. A template whose mean retry climbs from 1.0 to 1.8 has nearly doubled in cost without changing in correctness.
Cluster failures into fixes. Failing rows are not random. They concentrate on prompt templates, schema shapes, or input lengths. Cluster them, label each cluster, and ship the fix. The fix is usually small: a tightened Field description, a one-shot example, a switch to a different mode for that template, or a flattened sub-object on a schema that was nesting too deep.
from fi.evals import Evaluator
from fi.evals.templates import (
EvaluateFunctionCalling,
Completeness,
Groundedness,
CustomLLMJudge,
)
evaluator = Evaluator(fi_api_key=API_KEY, fi_secret_key=SECRET_KEY)
results = evaluator.evaluate(
eval_templates=[
EvaluateFunctionCalling(), # schema validity
Completeness(), # optional-field population rate
semantic_judge, # per-field semantic quality
Groundedness(), # free-text fields grounded in input
],
inputs=cases,
)
The CI gate is per-axis: EvaluateFunctionCalling >= 0.95, Completeness >= 0.85, ExtractionSemanticQuality >= 4.0 (on 1-5), Groundedness >= 0.80. A regression on any one axis fails the build on the axis that broke; one bisect, one fix.
How Future AGI ships the structured-output eval stack
Three surfaces, one loop.
ai-evaluation SDK (Apache 2.0) ships EvaluateFunctionCalling for schema-validity, Completeness for optional-field coverage, Groundedness and ContextAdherence for free-text fields, plus CustomLLMJudge for per-field rubrics. 50+ pre-built evaluators on Turing models (LARGE / SMALL / FLASH), 20+ local heuristic metrics that run sub-second with zero API cost, and async submission via evaluator.submit(...).wait().
traceAI (Apache 2.0) ships the OpenTelemetry-native span tree where every structured-output call records fi.span.kind=LLM, the mode used (json_schema, json_mode, responseSchema, tool_use, outlines_fsm), the schema-compliance result, the retry count, and the parsed object. 50+ instrumentors across Python, TypeScript, Java, and C#, including OpenAIInstrumentor, AnthropicInstrumentor, GeminiInstrumentor, and the InstructorInstrumentor covered in the Instructor evaluation deep dive. EvalTag rules attach per-field rubrics to spans so the same scorecard runs offline in CI and live on sampled production spans.
Future AGI Platform ships self-improving evaluators tuned by reviewer feedback, in-product rubric authoring, and classifier-backed scoring at lower per-eval cost than Galileo Luna-2. Error Feed sits inside the eval stack: HDBSCAN soft-clustering over span embeddings in ClickHouse, then a Claude Sonnet 4.5 Judge with a 30-turn budget and eight span-tools writes one immediate_fix per cluster. On structured-output work the clusters tend to be mode-shaped: a strict-mode enum-collapse cluster on one template, a Gemini union-failure cluster on schemas with five-plus members, an Outlines truncation cluster on long-output prompts. Each cluster carries a 5-category 30-subtype taxonomy entry, a 4-D trace score (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution), and an immediate_fix, usually a tightened schema, a one-shot, or a mode switch. Linear OAuth is wired today; Slack, GitHub, Jira, and PagerDuty are roadmap.
Honest tradeoff: if you ship one mode against one provider and your schemas are shallow, the deterministic schema-validity check alone is enough. The three-layer stack earns its weight when you run multiple modes in production with deep schemas and enough volume that a 3 percent semantic-quality delta translates into real customer cost.
Anti-patterns to avoid
Five mistakes that recur on production structured-output programs.
Reporting only schema-validity rate. A 99.5 percent rate with enum collapse is worse than a 96 percent rate with honest answers. Pair the schema rate with a per-field semantic score every time.
Mode-blind benchmarking. “GPT-4o vs Claude 3.5 Sonnet on JSON extraction” is the wrong unit. The right unit is “GPT-4o on strict mode vs GPT-4o on plain JSON mode vs Claude on tool use vs Claude on prompt-shaped JSON.” Mode is part of the system; score it.
Five-field schemas. Your test schema has five fields. Your production schema has fifty, with three levels of nesting, two unions, and four optional fields. Long schemas degrade differently; mirror production depth, width, and union complexity.
No retry-cost audit. Constrained-decoding retries and Instructor retry-on-validation-error loops are invisible in the headline rate. Chart retry distribution per template; alert on shifts; treat retries as a cost regression even when the final pass rate is unchanged.
Cross-provider parity assumed. The schema that works on OpenAI strict will not work identically on Anthropic tool use or Gemini responseSchema. The parity check is one eval pass per mode. Skip it once and the next provider migration becomes a quarter of unscheduled work.
What to do this week
Five steps, one schema.
- Pick one
response_modeland one prompt template that runs in production today. Pull 200 sampled traces with the input and the structured output. - Re-run those 200 inputs across all four modes (OpenAI strict, Anthropic tool use, Gemini
responseSchema, and Outlines if you have an open-weight path). Score schema validity,Completeness, and one per-fieldCustomLLMJudgeon each mode. - Build the mode-vs-mode matrix. Spot the cells where the schema-validity rate is high and the semantic score is low. Those are the silent-failure cells.
- Wire
traceAIso the production span carries the mode, the parse result, and the retry count. AddEvalTagrules so the per-field rubric runs on sampled production spans live. - Turn on Error Feed. Watch the first week’s clusters. Promote representative rows into the regression set. Run a
BayesianSearchOptimizerstudy on the template paying the largest quality tax, with the per-field judge score as the optimisation target.
The teams shipping reliable structured-output applications in 2026 stopped reporting “JSON parse rate” as the metric and started reporting schema_validity_rate × semantic_quality_on_passes per mode, per template. The mode gives you the shape for free. The eval stack tells you what the shape cost you.
Related reading
- Evaluating Instructor Structured Outputs (2026)
- LLM Function Calling Evaluation (2025)
- Deterministic vs LLM Judge Evals (2026)
- Evaluating Tool-Calling Agents (2026)
- Agent Passes Evals Fails Production (2026)
- The Definitive Guide to AI Agent Evaluation (2026)
- LLM Judge Prompt Engineering Guide (2026)
Frequently asked questions
What are the four main LLM structured output modes in 2026?
Why is schema validity a floor metric and not an eval?
What is the 'quality tax' of constrained decoding?
When should I pick Outlines or JSONFormer over hosted strict mode?
How does Future AGI score structured-output modes?
What are the common failure modes per structured-output mode?
What anti-patterns should I avoid when evaluating structured outputs?
Evaluating Instructor structured outputs in 2026: per-field rubrics, cross-field consistency, numeric drift, and traceAI instrumentation.
A first-person write-up: why a $40K judge bill pushed me to build deterministic LLM evaluation metrics first — schema, regex, structural, citation-validity.
Cheap-fast-statistically-significant LLM eval gates in GitHub Actions: classifier cascade, fi CLI exit codes, Welch's t-test, path-scoped triggers, auto-rollback.