What Is Agent-as-Judge?
An evaluation pattern where one AI agent scores another agent's plan, action sequence, or final answer against a rubric.
What Is Agent-as-Judge?
Agent-as-judge is an agent reliability pattern where one AI agent evaluates another agent’s plan, action sequence, tool choice, retry behavior, or final answer against a rubric. It extends LLM-as-a-judge from single prompt-response pairs to full multi-step trajectories inside an eval pipeline or production trace. A good judge agent checks whether the worker agent made forward progress, used safe tools, recovered from failures, respected policy, and completed the actual task. not just whether the final string looks plausible. FutureAGI records those judgments as eval scores connected to the traced agent run, so the same surface that shows the trajectory also shows where the judge flagged it.
In 2026 the pattern is no longer experimental. The original Agent-as-a-Judge paper formalized the trajectory-evaluation pattern in late 2024, and Anthropic, Google DeepMind, and OpenAI all now publish agent-as-judge results inside model cards for tool-using systems. The major frameworks. LangGraph, CrewAI, OpenAI Agents SDK, AutoGen, Google ADK, Strands. ship judge-agent recipes out of the box. The interesting questions have moved from “does it work” to “how do you calibrate it, how do you keep it from drifting, and how do you bind its verdicts back to traces.” Agent-as-judge is also one of the most-cited evaluation patterns on τ-bench, SWE-Bench Verified, and GAIA leaderboards because objective grading on those benchmarks usually requires inspecting the trajectory, not just the final answer. exactly the unit of analysis agent-as-judge operates on.
Why agent-as-judge matters in production LLM and agent systems
Agent failures rarely look like one bad sentence. They look like a plausible plan, a wrong tool call, a stale observation, two retries, and a final answer that sounds complete but skipped the real objective. A single-response judge can miss that chain because it sees only the last output. Agent-as-judge evaluates the whole agent trajectory, so it can flag “right answer, unsafe path” and “good plan, failed execution” separately. That distinction is what release gates and incident triage actually need.
Ignoring this pattern creates silent automation risk. A support agent may refund the wrong order after selecting the wrong internal tool. A coding agent may edit files, skip tests, and report success. exactly the failure mode SWE-Bench Verified was designed to catch and exactly the mode that vanity benchmarks miss. A research agent may cite a source it never opened. A finance agent may execute a transfer against a stale account. Developers feel this as flaky regression tests. SREs see long traces, retry bursts, and p99 latency jumps without an obvious cause. Product teams see confused users who cannot explain which step failed. Compliance teams worry when the agent took a high-impact action without approval evidence, because the audit trail shows “success” but the underlying tool calling was wrong.
This matters especially for 2026-era agent systems because the execution surface now spans tool servers, MCP-connected resources, A2A protocol handoffs to remote agents, browser actions, and long-running workflows that span days. Unlike Ragas faithfulness. which focuses narrowly on answer grounding for retrieval workflows. agent-as-judge asks whether the agent behaved correctly over time, across tools, across handoffs, and across retries. Unlike OpenAI’s Evals, which still defaults to single-response grading, agent-as-judge inside FutureAGI binds verdicts to span ids so the engineer can click from the failed score to the exact step that produced it. The unit of evaluation is no longer a response; it is the run.
The economic argument is also clearer in 2026 than it was earlier. Human review of multi-step agent runs costs roughly twenty to fifty times what human review of single-response chat costs, because the reviewer has to read every step, replay tool calls, and reconstruct state. A calibrated judge agent does the same work at a fraction of the cost. but only when its outputs are tied to observable metrics, audited for drift, and compared regularly against held-out human review. Without that calibration scaffolding, agent-as-judge is just a more expensive LLM hallucinating about another LLM’s hallucination.
The audience for agent-as-judge has also widened. In early 2024 it was mostly a research-lab pattern; by mid-2026 it is a standard release-gate component for any team running production agentic AI, and the most mature stacks combine three judge layers: a fast, cheap judge for live traffic sampling (low-cost model, narrow rubric); a deeper judge for nightly regression evals over golden datasets; and a human-in-the-loop calibration cohort that audits both judges every week. The cost curve of running the cheap judge against 100% of production traffic has dropped enough. frontier-class judging is now possible at a few hundredths of a cent per trace. that “judge every trace, not just sampled ones” is a reasonable default for safety-critical workflows.
Judge surfaces, scored against what
The table below maps the most common judge surfaces in 2026 production stacks to what they should score and which FutureAGI evaluator runs as the anchor.
| Judge surface | What it scores | Triangulation signal | FutureAGI anchor |
|---|---|---|---|
| Plan review | Quality of the initial plan vs goal | TaskCompletion on the final run | CustomEvaluation rubric |
| Tool-call review | Whether the right tool was selected at each step | Per-step ToolSelectionAccuracy | ToolSelectionAccuracy |
| Trajectory review | End-to-end path quality, retries, recovery | TrajectoryScore, step count | TrajectoryScore |
| Safety review | Policy violations, unsafe tool use, refusal correctness | Toxicity, PII, PromptInjection | CustomEvaluation + safety evals |
| Output review | Final answer quality given trajectory and goal | Faithfulness, AnswerRelevancy | CustomEvaluation rubric |
| Escalation review | When the agent should have asked for human help | Human-escalation rate, follow-up rate | CustomEvaluation rubric |
| Multi-agent review | Quality of A2A handoffs and sub-agent results | Per-hop TaskCompletion, AgentCard drift | TrajectoryScore + TaskCompletion |
The shared property across these surfaces: a judge agent score is only as useful as the objective signal it is triangulated against. Run the judge alone and you get an unverified opinion; pair it with TaskCompletion, ToolSelectionAccuracy, and TrajectoryScore and you get an audit trail.
How FutureAGI handles agent-as-judge
FutureAGI’s approach is to make the judge output traceable, reproducible, and comparable to objective agent metrics. The FAGI surface is CustomEvaluation: an engineer defines a judge rubric as code or a decorator, configures the judge model (typically a different family from the worker. Claude Opus 4.7 judging a GPT-5-driven agent, or vice versa), and runs the evaluator against an agent run captured through traceAI. For an OpenAI Agents SDK workflow, the traceAI-openai-agents integration records each reasoning step and tool call with agent.trajectory.step, gen_ai.tool.name, gen_ai.tool.call.arguments, gen_ai.tool.call.result, latency, model used, and the parent node id, so the judge has the full trajectory to score.
A real workflow looks like this. A travel-booking agent plans an itinerary, calls flight and hotel search tools, asks a payment sub-agent for authorization over A2A, and returns a confirmation. The judge agent gets the user goal, the trajectory, every tool call and its result, the policy rubric, and the final answer. The rubric says: “Score 0–1 for goal completion, unsafe-action avoidance, evidence use, refund-policy compliance, and correct escalation when budget is exceeded.” CustomEvaluation stores the judge score, the per-criterion sub-scores, and a free-text reason. ToolSelectionAccuracy independently checks whether the flight and hotel and payment tools were selected correctly. TrajectoryScore summarizes whether the sequence of steps moved toward the goal with reasonable efficiency.
The engineer does not stop at the judge score. They set a release gate: “judge_score >= 0.85 and ToolSelectionAccuracy >= 0.9 on the golden dataset, and judge-human disagreement rate < 12% on the last 100 calibration samples.” In production, a low judge score can open an alert, route the trace to annotation, trigger a model fallback inside Agent Command Center, or add the run to a regression eval cohort. The judge score also feeds back into agent observability dashboards keyed by agent.trajectory.step, so the team can answer “which steps did the judge flag, in which release, and on which cohort.”
In our 2026 evals at FutureAGI, we have found that the most common judge failure mode is verbosity bias. judges reward longer chains of thought even when they are wrong. Unlike DeepEval’s default G-Eval scorer, which sums sub-criteria into one number, FutureAGI’s CustomEvaluation rubric returns each criterion independently and exposes the prompt the judge saw, so we can diff judge prompts across versions and detect when a rubric drifts toward rewarding length over correctness. The second most common failure mode is judge-worker collusion: when the judge and the worker are the same model family, the judge systematically over-scores the worker’s reasoning. The fix is family separation by default. judges from one frontier family (Anthropic, OpenAI, Google) judging workers from another. The third pattern is judge drift over time, which the calibration cohort catches by measuring judge-human disagreement weekly. We treat any judge that crosses a configured drift threshold as a judge model that needs re-prompting or re-calibration before its scores can gate releases again.
For pre-production, agent-as-judge runs inside the evaluate workflow against a golden dataset. For live traffic, the same CustomEvaluation runs on sampled production traces visible in the tracing surface. Together they form a closed loop: every release is judged against a known cohort, every production hour is sampled, and every judge regression triggers either re-calibration or a worker rollback.
A second concrete example. A coding agent built on the OpenAI Agents SDK plans an issue fix across SWE-Bench Verified-style problems. The judge agent reviews the full run: did the planner identify the right files? Did the executor write tests before edits? Did the agent run the test suite after every change? Did it stop when tests passed? CustomEvaluation returns per-criterion sub-scores; ToolSelectionAccuracy checks per-step tool choice (read_file, write_file, run_tests); TrajectoryScore aggregates path quality. The team observes that judge scores drop sharply for problems requiring more than 8 file edits, while objective TaskCompletion stays high. a clear verbosity-bias signal where the judge penalizes longer correct trajectories. The fix is rubric-level: explicitly tell the judge that step count is not a quality signal unless paired with diminishing returns on test pass rate. The next calibration cohort restores judge-human agreement above 90%.
Engineers also use agent-as-judge as a feedback loop into agent memory and prompt design. When the judge consistently flags reasoning errors at a particular planning node, the team adjusts the planner prompt; when it flags retrieval misses, the team revisits the agentic RAG configuration; when it flags unsafe tool use, the team tightens guardrails at the Agent Command Center layer. The judge becomes a generator of regression hypotheses, not just a release gate.
How to measure or detect agent-as-judge quality
Measure agent-as-judge as an eval layer plus calibration checks, never as a standalone signal:
CustomEvaluation. runs the judge rubric and returns a score, pass/fail decision, and reason for the judged trajectory. The judge model should be from a different family than the worker.ToolSelectionAccuracy. objective triangulation: independently checks whether the worker agent chose the right tool at each step.TrajectoryScore. summarizes goal progress, step quality, and end-to-end trajectory health; pair with the judge to detect cases where the judge over-scores poor trajectories.TaskCompletion. the ultimate triangulation; a judge that says the run was great whileTaskCompletionsays 0.3 is a judge that needs re-calibration.FaithfulnessandGroundedness. when the agent cites sources or operates on retrieved context, run these alongside the judge.agent.trajectory.step. the trace attribute that lets dashboards slice judge results by step number, tool name, and retry count.- eval-fail-rate-by-cohort. dashboard signal for judge failures sliced by model version, prompt version, user segment, or tool set.
- judge-human disagreement rate. the most important calibration signal; sample N judge passes and failures per week and compare against human review. Re-prompt or re-train when disagreement crosses a threshold.
- judge-verbosity correlation. track whether longer trajectories correlate with higher judge scores; positive correlation is a red flag for verbosity bias.
Minimal Python pairing:
from fi.evals import CustomEvaluation, ToolSelectionAccuracy, TrajectoryScore
judge = CustomEvaluation(name="agent_judge", rubric=agent_rubric)
tool = ToolSelectionAccuracy()
trajectory = TrajectoryScore()
j = judge.evaluate(input=task, trajectory=agent_steps, output=final_answer)
t = tool.evaluate(trajectory=agent_steps)
ts = trajectory.evaluate(trace=agent_steps, goal=task)
print(j.score, t.score, ts.score, j.reason)
A healthy agent-as-judge deployment has stable judge-human agreement (typically above 85%), low correlation between trajectory length and judge score, and judge verdicts that move in the same direction as TaskCompletion across releases. When those three properties hold, the judge can gate releases. When any of them break, the judge needs re-prompting before its scores are trusted again. The same scores drive agent observability dashboards and downstream retrieval-augmented generation improvements when the judge flags grounding failures inside the worker’s retrieval steps.
To calibrate the judge against human labels and pipe disagreements into an AnnotationQueue, chain the judge with TaskCompletion and route low-confidence verdicts to reviewers:
from fi.evals import CustomEvaluation, TaskCompletion, AnnotationQueue
judge = CustomEvaluation(name="agent_judge", rubric=agent_rubric)
ground_truth = TaskCompletion()
queue = AnnotationQueue(name="agent-judge-calibration")
for trace in sampled_production_traces():
j = judge.evaluate(trajectory=trace.spans, output=trace.final)
g = ground_truth.evaluate(trajectory=trace.spans)
if abs(j.score - g.score) > 0.2 or j.confidence < 0.6:
queue.enqueue(trace_id=trace.id, judge=j, ground_truth=g)
# Weekly: pull labeled items, recompute judge-human agreement, refresh the rubric.
Common mistakes
- Judging only the final answer. Many agent failures live in the path: unsafe tool use, skipped verification, hidden retries, or ignored observations. Judge the trajectory.
- Using one judge without calibration. Compare judge scores against human review on a held-out cohort weekly before using them as deployment gates.
- Mixing safety and task success in one vague score. Separate completion, tool correctness, policy compliance, and escalation behavior into independent rubric criteria so each can be acted on.
- Letting the judge see hidden labels. If the judge receives gold answers or policy hints unavailable in production, scores will overstate reliability. Strip oracle data before judging.
- Treating the judge as ground truth. Agent-as-judge is a scalable signal, not proof; audit disagreements and drift over time and never gate safety-critical paths on the judge alone.
- Same-family judging. A GPT-5 worker scored by a GPT-5 judge inflates scores. Cross-family by default. Claude judges OpenAI, Gemini judges Anthropic, etc.
- Ignoring verbosity and position bias. Judges reward longer answers and answers in the first position when shown pairs. Randomize order and length-normalize where possible.
- No drift monitoring. Judge prompts drift as models update. Re-run the calibration cohort weekly and alert when disagreement crosses threshold.
- Skipping the trace anchor. Judge scores that are not bound to span ids cannot be debugged. Always attach the verdict to the trace.
Frequently Asked Questions
What is agent-as-judge?
Agent-as-judge is an evaluation pattern where one AI agent reviews another agent's plan, tool choice, intermediate step, or final answer against a rubric. The judge sees the trajectory, not just the final reply.
How is agent-as-judge different from LLM-as-a-judge?
LLM-as-a-judge usually scores one prompt-response pair. Agent-as-judge scores a trajectory: goals, reasoning steps, tool calls, observations, retries, and the final answer. useful for any multi-step agent workflow.
How do you measure agent-as-judge quality?
In FutureAGI, use CustomEvaluation for the judge rubric, then triangulate against ToolSelectionAccuracy, TrajectoryScore, TaskCompletion, and trace fields such as agent.trajectory.step. Always calibrate the judge against human review on a held-out cohort.