Evaluation

What Is a Pass/Fail Eval?

An evaluation pattern that converts an LLM output check into a binary pass or fail verdict for CI, tracing, or runtime gating.

What Is a Pass/Fail Eval?

A pass/fail eval is an LLM-evaluation check that converts model behavior into one binary verdict: pass or fail. It appears in eval pipelines, CI gates, production traces, and agent-runtime controls. The verdict may come from a boolean evaluator like JSONValidation, or from a scored evaluator like Groundedness compared with a threshold. In FutureAGI, pass/fail evals decide whether a prompt, model, RAG answer, tool call, or agent trajectory is acceptable enough to ship or return.

Why Pass/Fail Evals Matter in Production

The concrete failure mode is shipping a response that everyone knew how to score, but nobody wired into a decision. A RAG system may show a weak Groundedness score, yet still answer the user with an unsupported refund policy. An agent may complete the final message, yet fail ToolSelectionAccuracy because it called the billing tool for a support-status question. A structured-output flow may return malformed JSON, then break the downstream workflow several minutes later.

The pain spreads across teams. Developers lose release confidence because “quality improved” is not a deployable condition. SREs see retries, rising fallback traffic, and longer p99 latency but cannot tell whether the extra calls protected users or hid failures. Compliance and product teams feel the damage when bad answers reach customers because the eval dashboard was advisory only.

Pass/fail evals matter more for 2026-era agentic systems because every step can need a local verdict. A planner step, retrieval step, tool call, and final answer may each pass different checks. The system should not wait for a human to inspect a chart after the fact. Unlike a Ragas faithfulness score left as a dashboard metric, a pass/fail eval has an explicit owner, threshold, failure reason, and action: block the release, page the team, route to fallback, or send the trace to review.

How FutureAGI Handles Pass/Fail Evals

FutureAGI’s approach is to treat the verdict as a control signal, not a report label. The core eval surface is fi.evals: JSONValidation evaluates schema compliance, Groundedness checks whether the answer is supported by provided context, TaskCompletion evaluates whether the requested job finished, and AggregatedMetric combines several evaluators into one gate. In offline workflows, Dataset.add_evaluation() runs those evaluators against a golden dataset and stores per-row pass/fail status with the score and failure explanation.

A real example: a support RAG agent built on traceAI-langchain has a release gate with three checks. JSONValidation must pass because downstream systems parse the answer object. Groundedness must be at least 0.85 because every policy claim needs support from retrieved documents. TaskCompletion must pass because the agent may cite correctly but still fail to resolve the user’s request. AggregatedMetric rolls those checks into one release verdict while preserving the failing sub-evaluator for debugging.

When the gate fails, the engineer does not just see “bad run.” They see the failed evaluator, the trace span, the prompt version, the retrieved chunks, and the model route. In production, the same verdict can trigger a post-guardrail, a model fallback, or an alert on eval-fail-rate-by-cohort. The useful part is the join: one failing verdict points to the exact span and the next engineering action.

How to Measure or Detect Pass/Fail Eval Quality

Track the verdict and the health of the verdict:

  • Pass rate by evaluator: percentage of traces passing JSONValidation, Groundedness, TaskCompletion, or an AggregatedMetric gate.
  • False-positive rate: acceptable outputs blocked by the eval; calibrate this on a human-labeled sample before using a hard gate.
  • False-negative rate: failing outputs that pass and later receive thumbs-downs, escalations, refunds, or manual QA flags.
  • Eval-fail-rate-by-cohort: failures grouped by route, model, prompt version, customer segment, and traceAI integration such as traceAI-langchain.
  • Failure-action coverage: percentage of failed verdicts that trigger a concrete action: CI block, alert, annotation queue, post-guardrail, or fallback.

Minimal Python:

from fi.evals import Groundedness

evaluator = Groundedness()
result = evaluator.evaluate(input=query, output=answer, context=chunks)

passed = result.score >= 0.85
if not passed:
    block_release(result.explanation)

Common Mistakes

  • Treating a dashboard score as a pass/fail eval. A chart is not a verdict until it has a cutoff, owner, and action.
  • Using one global pass rate. A 94% overall pass rate can hide a 70% pass rate on enterprise traffic or one prompt version.
  • Mixing unrelated failures into one boolean. Keep JSONValidation, Groundedness, and TaskCompletion visible before composing them with AggregatedMetric.
  • Skipping human calibration. Thresholded pass/fail evals need labeled examples, or the false-positive rate becomes invisible until users complain.
  • Failing only the final answer. Agent systems also need step-level verdicts for retrieval, tool selection, and unsafe actions.

Frequently Asked Questions

What is a pass/fail eval?

A pass/fail eval is an LLM-evaluation check that returns a binary acceptable/failing verdict. It can come from a boolean evaluator like JSONValidation or from a score such as Groundedness compared with a threshold.

How is a pass/fail eval different from a metric threshold?

A metric threshold is the cutoff value. A pass/fail eval is the complete decision: evaluator, threshold or boolean rule, failure reason, and the action taken when the verdict is fail.

How do you measure a pass/fail eval?

Run FutureAGI evaluators such as JSONValidation, Groundedness, or TaskCompletion and track pass rate, fail rate by cohort, and false-positive rate. Live traces from traceAI-langchain can show which evaluator failed on which span.