How is agent-as-judge different from LLM-as-a-judge?

LLM-as-a-judge usually scores one prompt-response pair. Agent-as-judge scores a trajectory: goals, reasoning steps, tool calls, observations, retries, and the final answer.

How do you measure agent-as-judge quality?

In FutureAGI, use CustomEvaluation for the judge rubric, then compare its scores with ToolSelectionAccuracy, TrajectoryScore, and trace fields such as agent.trajectory.step.

Agent-as-Judge: Definition & FutureAGI Guide (2026)

Q: What is agent-as-judge?

Agent-as-judge is an agent evaluation pattern where one AI agent reviews another agent's plan, tool choice, intermediate step, or final answer against a rubric.

What Is Agent-as-Judge?

Agent-as-judge is an agent reliability pattern where one AI agent evaluates another agent’s plan, action sequence, tool choice, or final answer against a rubric. It extends LLM-as-a-judge from single responses to multi-step trajectories inside an eval pipeline or production trace. A good judge agent checks whether the worker agent made progress, used safe tools, recovered from failures, and completed the task. FutureAGI records those judgments as eval scores connected to the traced agent run.

Why It Matters in Production LLM and Agent Systems

Agent failures rarely look like one bad sentence. They look like a plausible plan, a wrong tool call, a stale observation, two retries, and a final answer that sounds complete but skipped the real objective. A single response judge can miss that chain because it sees only the last output. Agent-as-judge evaluates the whole trajectory, so it can flag “right answer, unsafe path” and “good plan, failed execution” separately.

Ignoring this pattern creates silent automation risk. A support agent may refund the wrong order after selecting the wrong internal tool. A coding agent may edit files, skip tests, and report success. A research agent may cite a source it never opened. Developers feel this as flaky regression tests. SREs see long traces, retry bursts, and p99 latency jumps. Product teams see confused users who cannot explain which step failed. Compliance teams worry when the agent took a high-impact action without approval evidence.

This is especially important for 2026-era agent systems because the execution surface now spans tool servers, MCP-connected resources, browser actions, sub-agents, and long-running workflows. Unlike Ragas faithfulness, which focuses on answer grounding for retrieval workflows, agent-as-judge asks whether the agent behaved correctly over time. The unit of evaluation is no longer a response. It is the run.

How FutureAGI Handles Agent-as-Judge

FutureAGI’s approach is to make the judge output traceable, reproducible, and comparable to objective agent metrics. The FAGI surface is eval:CustomEvaluation: an engineer creates a custom evaluator from a rubric or decorator, then runs it against an agent trajectory captured through traceAI. For an OpenAI Agents SDK workflow, the traceAI-openai-agents integration records each reasoning step and tool call with agent.trajectory.step, inputs, observations, tool names, latency, and final output.

A real workflow looks like this. A travel-booking agent plans an itinerary, calls search tools, asks a payment sub-agent for authorization, and returns a confirmation. The judge agent gets the goal, trajectory, tool results, and policy rubric: “Score 0-1 for goal completion, unsafe action avoidance, evidence use, and correct escalation.” CustomEvaluation stores the judge score and reason. ToolSelectionAccuracy checks whether the search and payment tools were selected correctly. TrajectoryScore summarizes whether the sequence of steps moved toward the goal.

The engineer does not stop at the judge score. They set a release gate such as “judge_score >= 0.85 and ToolSelectionAccuracy >= 0.9 on the golden dataset.” In production, a low judge score can open an alert, route the trace to annotation, trigger a model fallback, or add the run to a regression eval. This keeps the agent-as-judge pattern from becoming another opaque model opinion.

How to Measure or Detect It

Measure agent-as-judge as an eval layer plus agreement checks:

CustomEvaluation: runs the judge rubric and returns a score, pass/fail decision, and reason for the judged trajectory.
ToolSelectionAccuracy: checks whether the worker agent chose the right tool at each step.
TrajectoryScore: summarizes goal progress, step quality, and end-to-end trajectory health.
agent.trajectory.step: the trace attribute that lets dashboards slice the judge result by step number, tool, and retry.
eval-fail-rate-by-cohort: dashboard signal for judge failures by model version, prompt version, user segment, or tool set.
human disagreement rate: proxy for calibration; sample judge failures and passes against human reviewers.

Minimal Python:

from fi.evals import CustomEvaluation

judge = CustomEvaluation(name="agent_judge", rubric=agent_rubric)
result = judge.evaluate(
    input=task,
    trajectory=agent_steps,
    output=final_answer,
)
print(result.score, result.reason)

Common mistakes

Judging only the final answer. Many agent failures live in the path: unsafe tool use, skipped verification, hidden retries, or ignored observations.
Using one judge without calibration. Compare judge scores against human review before using them as deployment gates.
Mixing safety and task success in one vague score. Separate completion, tool correctness, policy compliance, and escalation behavior.
Letting the judge see hidden labels. If the judge receives gold answers or policy hints unavailable in production, scores will overstate reliability.
Treating the judge as ground truth. Agent-as-judge is a scalable signal, not proof; audit disagreements and drift over time.