Guides

Evaluating LLM Structured Output Modes (2026)

Compare OpenAI strict, Anthropic JSON, Gemini schema, and Outlines grammar-constrained generation: schema-validity rate, quality tax, failure modes.

·
Updated
·
13 min read
llm-evaluation structured-outputs json-schema constrained-decoding outlines openai-strict-mode 2026
Editorial cover image for Evaluating LLM Structured Output Modes (2026)
Table of Contents

An extraction pipeline ships on OpenAI strict with a 99.4 percent schema-validity rate. Every JSON parses. Every Field constraint holds. Three weeks later the support team finds out that priority: 'urgent' was the chosen value on roughly 11 percent of inbound tickets when the right answer was 'normal' or 'high'. The model had collapsed to a safer default to keep the grammar happy. The schema check never had a chance.

Structured output modes guarantee the schema, not the quality. OpenAI strict, Anthropic JSON, Gemini responseSchema, and Outlines / JSONFormer for grammar-constrained generation all produce parseable output. They do not produce the same answer the model would have produced if you had let it write English. Each mode carries its own quality tax: sometimes zero, sometimes fifteen percent on hard prompts. The only eval that matters is schema_validity_rate × semantic_quality_on_passes, measured on your data, per mode. This post is the working pattern for that measurement, with the failure-mode catalog and the production loop that catches the tax before it ships.

Why mode comparison matters

Most teams pick a structured-output mode the way they pick a font. OpenAI ships strict, so the OpenAI client uses strict. Claude has a tool-use surface, so the Claude code path uses tool-use. Gemini has responseSchema, so Gemini gets responseSchema. The headline parse rate hits 99 and the code ships.

That pattern fails because the four modes are different systems with different failure profiles. The schema-validity numbers converge near 100. The semantic-quality numbers do not, and the gap is the entire post.

Three reasons mode choice deserves its own eval pass. First, the LLM function calling deep dive is the closest companion pattern but only covers tool-use; you can have structured output without a tool call and the modes diverge there. Second, the agent passes evals fails production failure shape often turns out to be mode-driven: a constrained decoder degrading silently on prompt templates the eval set under-samples. Third, retries are the silent cost line item. Outlines refusals, OpenAI strict’s occasional parse failures, and Instructor-style retry-on-validation-error all compound into a bill your dashboard does not surface unless you instrument for it.

The four modes: mechanics and guarantees

Read these as four implementations of the same interface, not four flavors of the same thing.

OpenAI strict (response_format={"type": "json_schema", "strict": true}) compiles the supplied JSON Schema into a constrained-decoding grammar at the inference layer. Every token sampled is filtered through a mask that keeps the partial output schema-legal. Guarantee: if the call returns, the JSON parses. Failure shape: optional fields silently dropped under the required-by-default behavior, refusals on schemas the decoder cannot represent (recursive types, dynamic property names), and on hard prompts the model’s preferred token gets masked out and the answer collapses to a safer enum default.

Anthropic JSON mode has two surfaces. The simple one is prompt-shaped: you ask the model to return JSON in the system prompt and parse the response. There is no token-level constraint; the guarantee is “best effort” and the failure shape is near-JSON output (trailing commentary, missing braces, comments). The richer surface is tool use, where input_schema becomes the contract and Anthropic validates the structured tool call. Tool use has higher schema-validity than prompt-shaped JSON mode and preserves more of the model’s free-form quality, because the constraint lives at the validation layer rather than the decoding layer. Known wart: on long inputs the model can stringify a nested sub-object to save tokens, leaving the parser with JSON inside a JSON string.

Gemini responseSchema sits between the two. The configuration accepts a subset of JSON Schema, the inference layer validates server-side, and violations come back as an error envelope rather than a generated string. Guarantee: the response is either schema-valid or explicitly rejected. Failure shape: union types with many members fail silently, deep nesting (more than three levels) increases the error rate, and the supported subset is narrower than OpenAI’s, so a schema that works on gpt-4o may not be expressible on gemini-1.5-pro.

Outlines and JSONFormer are grammar-constrained generators that run over open-weight models. They build a finite-state machine from the schema (or a richer grammar: regex, CFG, custom FSM) and at every decoding step mask the logits so only state-machine-legal tokens can be sampled. Guarantee is the strongest of the four: if the model generates a token, it is grammar-legal. Failure shape: the quality tax interacts with base-model capability. Strong open-weight models (Llama-3-70B, Mistral-Large) pay a small tax; weaker models collapse to whichever schema-legal continuation has the lowest perplexity, which is rarely the right answer. Outlines also supports grammars richer than JSON Schema, so you can constrain a chain-of-thought to a regular expression or a multi-step plan to a CFG. The deterministic vs LLM judge evals post covers where deterministic grammars pull weight judges cannot.

ModeMechanismSchema guaranteeQuality tax (rough)
OpenAI strictToken-level constrained decodingStrong (returns iff schema-legal)0-3 percent easy, 10-15 percent hard
Anthropic JSON / tool usePrompt + post-validationMedium (best effort), Strong (tool use)0-2 percent easy, 4-8 percent hard
Gemini responseSchemaServer-side schema validationStrong (validated or rejected)0-3 percent easy, 8-12 percent hard
Outlines / JSONFormerFSM logit maskingStrongest (grammar-legal tokens only)Scales with base model; 5-20 percent

These are ranges from customer workloads in the first half of 2026, not from a controlled benchmark. Treat them as the right order of magnitude on a 200-row golden set; run your own reproduction before treating them as your number.

Schema-validity rate per mode (the floor)

The first axis is the easy one: did the output parse. Score it deterministically, run it free, treat the result as the floor of the eval, not as a substitute for the rest.

import json
from pydantic import BaseModel, ValidationError
from fi.evals import Evaluator
from fi.evals.templates import EvaluateFunctionCalling

evaluator = Evaluator(fi_api_key="...", fi_secret_key="...")

def schema_validity(raw_text: str, response_model: type[BaseModel]) -> bool:
    try:
        response_model.model_validate_json(raw_text)
        return True
    except (ValidationError, json.JSONDecodeError):
        return False

# Run across the golden set per (provider, mode) and aggregate.
results = evaluator.evaluate(
    eval_templates=[EvaluateFunctionCalling()],
    inputs=cases,
)

EvaluateFunctionCalling (verified at fi/evals/templates.py, eval_id="98") is the deterministic schema-compliance template. It scores whether the structured payload conforms to the declared function or response model; pair it with the Pydantic model_validate_json for the offline floor.

Two patterns to watch as you aggregate. First, the cells where the rate sits between 95 and 99 percent are the interesting ones. At 100 percent you have a mode that works; at 90 percent you have a mode that does not; at 97 you have a mode whose failures concentrate on a recognisable subset of inputs. Cluster those failures. Second, modes where the rate hits 100 percent with retries hidden behind the SDK (Instructor on top of OpenAI strict, for instance) are paying the cost on the retry line item. Chart the retry distribution per template alongside the headline rate.

The right unit is not “schema validity per call.” It is “schema validity per call after N retries, at cost C.” The companion evaluating Instructor structured outputs post covers the retry cost shape directly.

Semantic quality tax (the actual answer)

The harder axis is the one most teams skip. After the schema validates, are the values right?

The cleanest measurement is a head-to-head: same model, same prompt, free-form vs constrained. Run the prompt with no structured-output mode and grade the answer with a judge. Run the same prompt under strict / JSON / schema / Outlines and grade against the same rubric. The delta is the quality tax for that mode on that prompt.

from fi.evals.templates import CustomLLMJudge

semantic_judge = CustomLLMJudge(
    name="ExtractionSemanticQuality",
    rubric=(
        "Score whether the extracted fields reflect the user input. "
        "5 = every field accurate and well-grounded in the input. "
        "3 = one or two fields off (wrong enum, wrong category, but plausible). "
        "1 = multiple fields wrong or hallucinated. "
        "Ignore JSON syntax; assume the schema parses. "
        "Score the semantics."
    ),
    input_mapping={
        "user_input": "input",
        "extracted_object": "output",
    },
)

freeform_results = evaluator.evaluate(eval_templates=[semantic_judge], inputs=freeform_cases)
strict_results   = evaluator.evaluate(eval_templates=[semantic_judge], inputs=strict_cases)

Average across the golden set, per mode. The mode-vs-freeform delta is the quality tax. Track it per prompt template, not just per mode. The tax on routine extraction is usually within noise; the tax on adversarial or long-tail prompts is where the mode choice earns its keep.

Three places the tax shows up loudest. Enum collapse: a Literal["low", "normal", "high", "urgent"] field where the model under constraint defaults to "normal" more often than free-form would. Numeric drift: a Field(ge=0, le=120) on age where the constrained model returns the corpus mean when the input never mentioned an age. Field dropping: optional fields silently omitted under strict modes because the constrained decoder treats them as zero-cost to skip. The LLM judge prompt engineering guide covers rubric patterns that catch enum collapse specifically.

Failure-mode catalog (refused, partial, truncated, degraded)

Four failure shapes recur, in the order they hurt.

Refused. The model returns a refusal or an error envelope. OpenAI strict raises a 400 on schemas the decoder cannot compile; Gemini returns an error on responseSchema violations; Outlines fails to find a schema-legal completion within the max-tokens budget. Refusals are visible (your error logs catch them) and the fix is usually a schema simplification.

Partial. Required fields populated, optional fields the model would have returned in free-form mode dropped. The most common silent failure on OpenAI strict and Gemini, because the constraint optimises for the cheapest schema-legal completion and an absent optional field is cheap. Eval signal: Completeness (verified at fi/evals/templates.py). Score the population rate of optional fields per mode and the gap appears.

Truncated. A deep nested object or long array hits max_tokens mid-emit. The JSON is invalid; the schema validator rejects it; the failure looks like a schema-validity miss. Mitigation: raise max_tokens and chunk long outputs across multiple calls. Eval signal: parse-error analysis on failing rows. A JSONDecodeError at the tail of a long array is a truncation, not a schema design bug.

Schema-valid but semantically degraded. Every field parses. Every Literal is in the allowed set. Every numeric field sits inside its Pydantic range. And the answer is wrong: the enum collapsed to a safe default, the numeric field returned the prior mean, the free-text summary referenced none of the input. The first three look like errors. The fourth looks like success and is the one that ships. The agent failure modes breakdown covers the agent-level taxonomy this slots into.

The right diagnostic is the per-field judge plus the cross-field assertion. Schema validity will not see the fourth failure mode. A semantic judge per field will.

Production patterns: what the loop looks like

The order of operations that holds up on real structured-output workloads.

Layer the gates cheapest-first. Schema validity is the deterministic floor: run it free, fail fast. Cross-field assertions (range tightening, sum-of-line-items checks, date-range invariants) run next, also deterministic, also free. Per-field semantic judges run last, only on rows the deterministic gates let through. Judges are the expensive line item; push as much catch as possible to the free layers.

Score the modes head-to-head on the same golden set. 200 to 500 prompts that mirror production depth and width. Run the same input under all four modes (OpenAI strict, Anthropic JSON / tool use, Gemini schema, Outlines if you run open-weight). Score the same rubric stack on each. The resulting matrix (mode on the rows, eval axis on the columns) is the artifact that decides the production routing.

Route per-template, not per-application. The cheapest mode that passes the rubric is the right mode for that template, not the right mode for the application. A simple extraction template might run cleanly on Gemini schema; an adversarial-input template might need OpenAI strict; a free-form-with-light-structure template might do better on Anthropic tool use. The AI gateway evaluation post covers per-template routing through the gateway abstraction.

Watch retries as a cost signal. Strict modes that re-prompt on parse failure, Instructor-style retry loops, and Outlines refusals all compound into a cost line item that does not show up in the headline schema-validity rate. Chart retry count per (mode, template) and alert on shifts. A template whose mean retry climbs from 1.0 to 1.8 has nearly doubled in cost without changing in correctness.

Cluster failures into fixes. Failing rows are not random. They concentrate on prompt templates, schema shapes, or input lengths. Cluster them, label each cluster, and ship the fix. The fix is usually small: a tightened Field description, a one-shot example, a switch to a different mode for that template, or a flattened sub-object on a schema that was nesting too deep.

from fi.evals import Evaluator
from fi.evals.templates import (
    EvaluateFunctionCalling,
    Completeness,
    Groundedness,
    CustomLLMJudge,
)

evaluator = Evaluator(fi_api_key=API_KEY, fi_secret_key=SECRET_KEY)

results = evaluator.evaluate(
    eval_templates=[
        EvaluateFunctionCalling(),          # schema validity
        Completeness(),                      # optional-field population rate
        semantic_judge,                      # per-field semantic quality
        Groundedness(),                      # free-text fields grounded in input
    ],
    inputs=cases,
)

The CI gate is per-axis: EvaluateFunctionCalling >= 0.95, Completeness >= 0.85, ExtractionSemanticQuality >= 4.0 (on 1-5), Groundedness >= 0.80. A regression on any one axis fails the build on the axis that broke; one bisect, one fix.

How Future AGI ships the structured-output eval stack

Three surfaces, one loop.

ai-evaluation SDK (Apache 2.0) ships EvaluateFunctionCalling for schema-validity, Completeness for optional-field coverage, Groundedness and ContextAdherence for free-text fields, plus CustomLLMJudge for per-field rubrics. 50+ pre-built evaluators on Turing models (LARGE / SMALL / FLASH), 20+ local heuristic metrics that run sub-second with zero API cost, and async submission via evaluator.submit(...).wait().

traceAI (Apache 2.0) ships the OpenTelemetry-native span tree where every structured-output call records fi.span.kind=LLM, the mode used (json_schema, json_mode, responseSchema, tool_use, outlines_fsm), the schema-compliance result, the retry count, and the parsed object. 50+ instrumentors across Python, TypeScript, Java, and C#, including OpenAIInstrumentor, AnthropicInstrumentor, GeminiInstrumentor, and the InstructorInstrumentor covered in the Instructor evaluation deep dive. EvalTag rules attach per-field rubrics to spans so the same scorecard runs offline in CI and live on sampled production spans.

Future AGI Platform ships self-improving evaluators tuned by reviewer feedback, in-product rubric authoring, and classifier-backed scoring at lower per-eval cost than Galileo Luna-2. Error Feed sits inside the eval stack: HDBSCAN soft-clustering over span embeddings in ClickHouse, then a Claude Sonnet 4.5 Judge with a 30-turn budget and eight span-tools writes one immediate_fix per cluster. On structured-output work the clusters tend to be mode-shaped: a strict-mode enum-collapse cluster on one template, a Gemini union-failure cluster on schemas with five-plus members, an Outlines truncation cluster on long-output prompts. Each cluster carries a 5-category 30-subtype taxonomy entry, a 4-D trace score (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution), and an immediate_fix, usually a tightened schema, a one-shot, or a mode switch. Linear OAuth is wired today; Slack, GitHub, Jira, and PagerDuty are roadmap.

Honest tradeoff: if you ship one mode against one provider and your schemas are shallow, the deterministic schema-validity check alone is enough. The three-layer stack earns its weight when you run multiple modes in production with deep schemas and enough volume that a 3 percent semantic-quality delta translates into real customer cost.

Anti-patterns to avoid

Five mistakes that recur on production structured-output programs.

Reporting only schema-validity rate. A 99.5 percent rate with enum collapse is worse than a 96 percent rate with honest answers. Pair the schema rate with a per-field semantic score every time.

Mode-blind benchmarking. “GPT-4o vs Claude 3.5 Sonnet on JSON extraction” is the wrong unit. The right unit is “GPT-4o on strict mode vs GPT-4o on plain JSON mode vs Claude on tool use vs Claude on prompt-shaped JSON.” Mode is part of the system; score it.

Five-field schemas. Your test schema has five fields. Your production schema has fifty, with three levels of nesting, two unions, and four optional fields. Long schemas degrade differently; mirror production depth, width, and union complexity.

No retry-cost audit. Constrained-decoding retries and Instructor retry-on-validation-error loops are invisible in the headline rate. Chart retry distribution per template; alert on shifts; treat retries as a cost regression even when the final pass rate is unchanged.

Cross-provider parity assumed. The schema that works on OpenAI strict will not work identically on Anthropic tool use or Gemini responseSchema. The parity check is one eval pass per mode. Skip it once and the next provider migration becomes a quarter of unscheduled work.

What to do this week

Five steps, one schema.

  1. Pick one response_model and one prompt template that runs in production today. Pull 200 sampled traces with the input and the structured output.
  2. Re-run those 200 inputs across all four modes (OpenAI strict, Anthropic tool use, Gemini responseSchema, and Outlines if you have an open-weight path). Score schema validity, Completeness, and one per-field CustomLLMJudge on each mode.
  3. Build the mode-vs-mode matrix. Spot the cells where the schema-validity rate is high and the semantic score is low. Those are the silent-failure cells.
  4. Wire traceAI so the production span carries the mode, the parse result, and the retry count. Add EvalTag rules so the per-field rubric runs on sampled production spans live.
  5. Turn on Error Feed. Watch the first week’s clusters. Promote representative rows into the regression set. Run a BayesianSearchOptimizer study on the template paying the largest quality tax, with the per-field judge score as the optimisation target.

The teams shipping reliable structured-output applications in 2026 stopped reporting “JSON parse rate” as the metric and started reporting schema_validity_rate × semantic_quality_on_passes per mode, per template. The mode gives you the shape for free. The eval stack tells you what the shape cost you.

Frequently asked questions

What are the four main LLM structured output modes in 2026?
Four shapes do the work in production. OpenAI strict (`response_format={'type': 'json_schema', 'strict': true}`) compiles your JSON Schema into a constrained-decoding grammar; every token sampled has to keep the partial output schema-legal. Anthropic JSON mode is prompt-shaped with a tool-use fallback: there is no token-level constraint, the model is told to return JSON and you parse what comes back. Gemini schema mode (`responseSchema` + `responseMimeType='application/json'`) validates a subset of JSON Schema server-side and rejects responses that violate it. Outlines and JSONFormer apply finite-state-machine constraints over open-weight models like Llama and Mistral, masking the logits at every step so non-schema tokens cannot be sampled. The names converge on `'structured output'`. The mechanisms do not.
Why is schema validity a floor metric and not an eval?
Schema validity answers one question: is the output parseable against the contract. It does not answer whether the values are right. A `Customer(age=35)` returned for a prompt that never mentioned an age is schema-valid and semantically wrong. A `priority: 'urgent'` chosen because the model defaulted to the most-common enum value is schema-valid and semantically wrong. The eval that matters in production is `schema_validity_rate × semantic_quality_on_passes`, scored on your data, per mode. A mode that hits 99.9 percent schema validity by collapsing nuance into safe enum defaults is worse than a mode that hits 97 percent with sharper semantics. The headline number hides the trade.
What is the 'quality tax' of constrained decoding?
Constrained decoding masks logits to keep the output schema-legal, which means it sometimes blocks the token the model would have sampled if it were free. On easy prompts the tax is roughly zero — the schema-legal token and the model's first choice are the same token. On hard prompts (long reasoning, ambiguous enums, deep nesting, optional-field decisions) the tax shows up as flatter distributions and worse semantic answers. Our running estimate across customer workloads: OpenAI strict carries 0-3 percent semantic quality delta vs free-form on routine extraction, climbing to 10-15 percent on adversarial prompts with deep schemas. Outlines on small open-weight models can hit a 15-20 percent delta because the masking interacts with weaker base capability. The tax is not a constant; treat it as a per-schema, per-prompt-template number you measure.
When should I pick Outlines or JSONFormer over hosted strict mode?
Three cases. First, you are running open-weight models in-VPC and need schema guarantees that the base provider does not ship. Second, you need a grammar richer than JSON Schema (regular expressions, context-free grammars, custom finite-state automata) — Outlines supports all of them. Third, you cannot tolerate the API round trip to a hosted strict-mode endpoint and need the constraint to live next to your weights. The trade is that the quality tax is usually larger on weaker base models, and you own the engineering on the constrained-decoding loop. For hosted models with strict mode available, pay the API and skip the operational overhead.
How does Future AGI score structured-output modes?
Three layers in one pass. Schema validity runs free as a deterministic check (`EvaluateFunctionCalling` plus a JSON Schema parse). Per-field semantic rubrics run as `CustomLLMJudge` instances, one judge per field with non-trivial semantics, on the same 1-5 scale. Cross-field consistency runs as a pre-judge deterministic gate (`@model_validator` on Pydantic or a list of pure-Python assertions). Every result is logged through `traceAI` so the same per-mode, per-provider scorecard runs offline in CI and live on sampled production spans. Error Feed clusters the failures by field and writes an `immediate_fix` per cluster — usually a tightened schema constraint, a one-shot example, or a mode switch on the prompt template that is paying the largest quality tax.
What are the common failure modes per structured-output mode?
Four shapes recur. (1) Refused: the model returns a refusal string or a schema-shaped error envelope; strict modes raise, JSON mode silently embeds the refusal in a `summary` field. (2) Partial: required fields are populated, optional fields the model would have returned in free-form mode are silently dropped — most common on OpenAI strict and Gemini schema. (3) Truncated: a deep nested object stops mid-array because the response hit max_tokens; the JSON is invalid and the schema validator rejects it. (4) Schema-valid but semantically degraded: every field parses, the enum collapsed to a safe default, the numeric field returned the corpus mean. The first three look like errors. The fourth looks like success and is the one that ships.
What anti-patterns should I avoid when evaluating structured outputs?
Five. (1) Reporting only schema-validity rate; a 99 percent rate with collapsed enums is worse than a 95 percent rate with honest answers. (2) Mode-blind benchmarks; the same model on `json_mode` and `strict` are different systems with different failure profiles. (3) Five-field toy schemas; production runs fifty-field schemas with unions, nesting, and arrays — the failure modes only show up at depth. (4) No retry-cost audit; constrained-decoding retries on parse failure are invisible in the headline rate and double the bill on the failure tail. (5) Cross-provider parity assumed instead of measured; OpenAI strict, Anthropic JSON, and Gemini schema for the same Pydantic model behave differently enough that ports break silently.
Related Articles
View all