What is the JSON validation metric?

The JSON validation metric is a deterministic LLM-eval that returns pass/fail based on whether an output parses as JSON and conforms to a target JSON Schema. It is the canonical structured-output regression check.

How is JSON validation different from contains-JSON?

Contains-JSON only checks that the output text has a parseable JSON object somewhere inside. JSON validation parses the full output and validates it against a JSON Schema with strict typing, required fields, and enum rules.

How do you measure JSON validation in production?

FutureAGI's `fi.evals.JSONValidation` evaluator returns a boolean per row against a JSON Schema. Attach it to every span where structured output is required; alert on invalid-JSON-rate by route and prompt version.

JSON Validation Metric: Definition & FutureAGI Guide (2026)

What Is JSON Validation Metric?

The JSON validation metric is an LLM-evaluation signal that returns whether a model’s output is parseable JSON and conforms to a target JSON Schema. It is deterministic — no judge model, no embedding similarity, just json.loads plus schema checks. Engineering teams attach it to every span where a tool-using agent, function-calling endpoint, or structured-output API is supposed to emit JSON, then track invalid-JSON-rate as a release gate. It is the canonical regression check that catches the most common silent break in LLM applications: a prompt or model change that produces malformed output for a small fraction of traffic.

Why JSON Validation Metric Matters in Production LLM and Agent Systems

A function-calling endpoint that emits malformed JSON 1.5% of the time looks fine in offline tests and breaks downstream services in production. The downstream pipeline either crashes, retries, or — worst case — silently drops the row, all of which reach the user as a vague “something went wrong.” The failure mode is invisible without a JSON validation eval because the LLM looks “successful” from its own perspective: it produced text, the request returned 200, the trace closed cleanly. Only the consumer of that text knows the contract was violated.

Backend engineers feel this first when a downstream parser raises. SREs see retry storms when an upstream LLM suddenly fails schema and the agent loops. Product managers see conversion drops on flows that depend on structured-output reliability — checkout JSON, search filters, calendar event creation. Compliance teams care because PII redaction depends on schema-correct outputs flowing into the redactor.

In 2026 agent stacks the impact compounds. A multi-step agent might emit five tool calls per trace, each with its own JSON-schema contract: planner output, tool args, retriever filter, critique-step JSON, final response envelope. One invalid field in step two corrupts steps three through five. A trajectory-level JSON validation policy catches this; a single end-to-end answer eval will not. This is why JSONValidation runs both per-step in offline regression and live on production spans, and why model-context-protocol (MCP) tool calls are validated server-side before the agent advances.

How FutureAGI Handles JSON Validation Metric

FutureAGI’s approach is to make JSON validation cheap to wire in everywhere structured output exists. The fi.evals.JSONValidation evaluator takes a JSON Schema and returns a boolean per row, with the parsing error string in result.reason when the row fails. The lighter-weight fi.evals.ContainsJson evaluator returns true if a parseable JSON blob exists anywhere in the output text — useful for free-form replies that should embed structured data. For partial-credit cases, fi.evals.FieldCompleteness measures the share of expected schema fields that are populated, which is more useful than a single pass/fail when an agent is expected to fill 12 fields and only managed 11. Unlike OpenAI Structured Outputs, which constrains generation at one provider boundary, this eval checks the actual persisted payload across providers and agent frameworks.

Concretely: a team building a structured-extraction endpoint loads a Dataset of representative inputs, attaches JSONValidation with their target schema via Dataset.add_evaluation, and gates every prompt or model change on invalid-JSON-rate staying under 0.5%. In production, the same evaluator runs against llm.output.text on every span emitted by traceAI-openai or traceAI-langchain, and writes the boolean back as a span_event. A dashboard slices invalid-JSON-rate by route, prompt version, and model — so when a prompt edit lands and the rate spikes on one route, the team rolls back before users see retries. FutureAGI’s view is that structured-output reliability is observability, not just an offline check.

How to Measure JSON Validation Metric

Pick the evaluator that matches the contract — strict schema, contains-JSON, or field completeness — and attach it where structured output should appear:

fi.evals.JSONValidation: returns pass/fail against a JSON Schema; the canonical structured-output regression check.
fi.evals.ContainsJson: returns true if the output contains any parseable JSON; use for free-form chat that should embed JSON.
fi.evals.FieldCompleteness: returns the share of required fields populated; useful for agent extraction tasks.
Invalid-JSON-rate by cohort (dashboard signal): the share of spans failing JSON validation, sliced by route, model, and prompt version.
Retry-storm signal: paired with downstream parser-error counts — JSON validation failure plus a retry spike is incident-grade.

Minimal Python:

from fi.evals import JSONValidation

schema = {
    "type": "object",
    "properties": {"name": {"type": "string"}, "age": {"type": "integer"}},
    "required": ["name", "age"],
}

result = JSONValidation(schema=schema).evaluate(
    output='{"name": "Ada", "age": 36}'
)
print(result.score, result.reason)

Common mistakes

Validating with ContainsJson when the contract requires a strict schema. A response can contain JSON and still violate the schema — use JSONValidation for contracts.
Skipping schema versioning. When the schema evolves, old golden-dataset rows fail under the new schema; version both together.
Treating invalid-JSON-rate as a global metric. Slice by route, model, and prompt version — drift usually concentrates on one of those axes.
Letting the LLM “fix” malformed JSON in a retry without logging. Silent self-repair masks the real failure rate; log every retry as a JSON validation failure.
No alert threshold. A JSON validation eval that runs but never pages anyone is a vanity metric; set an SLO like 99.5% pass-rate per route.