Top 5 G-Eval Use Cases (2026): Where Rubric Judges Actually Win
Five use cases where G-Eval is the right primitive: subjective rubric scoring, faithfulness on free-form text, custom-domain rubrics, multi-criterion weighted scoring, and reasoning-step evaluation. Plus when to switch.
Table of Contents
Your team ships a customer-support agent. You want a score for “is this actually helpful,” another for “did the assistant hold the warm-but-professional tone the brand guide demands,” and a third for “did the reasoning justify the answer.” None of those scores live in a classifier. None reduce to embedding similarity. They are rubrics, and you need a judge model to evaluate them.
G-Eval (Liu et al. 2023) is the right primitive — LLM-as-judge scoring with chain-of-thought reasoning and a structured form-filled output. The question is no longer “should I use G-Eval,” it’s “which of my rubrics actually need a judge, and where am I burning tokens on a job a parser could do.”
This is the use-case companion to the G-Eval definitive guide. The guide covers the method, the biases, and the hardening pattern. This post covers the five places G-Eval earns the cost of a judge call, plus the cases where a cheaper primitive wins.
TL;DR: the five categories
| Use case | Why G-Eval is the right primitive | What the cheaper alternative misses |
|---|---|---|
| Subjective rubric scoring | Property scoring, not reference matching | Embedding similarity scores lexical overlap, not “is this warm” |
| Faithfulness on free-form text | No gold answer to diff against | BLEU/ROUGE need a reference; classifiers don’t read your context |
| Custom-domain rubrics | Grades against your policy doc | No classifier ships pre-trained on your internal style guide |
| Multi-criterion weighted scoring | One judge call, N axes, coordinated reasoning | Five single-criterion judges cost five times more and disagree |
| Reasoning-step evaluation | Grades the chain of thought, not the final answer | A correct answer with broken reasoning passes every other check |
All five share a property: a regex, a parser, an embedding similarity, or a classifier can’t capture the criterion. Outside these five, the cheaper primitive wins.
Use case 1: subjective-rubric scoring
The shape. You have an output and a property you want to score the output against. The property is subjective — “warm tone,” “concise,” “actionable,” “matches our brand voice.” There’s no gold reference to diff against, no classifier pre-trained on your specific notion of “warm,” and no parser that can pattern-match “actionable.” The rubric is the only way to score it, and the rubric needs a judge that can reason about whether the output exhibits the property.
This is G-Eval’s home territory. The original paper landed on SummEval — summarization scored on coherence, consistency, fluency, and relevance — and hit Spearman 0.514 against human raters, the strongest correlation on that benchmark at the time. The recipe transfers cleanly to any subjective property: helpfulness, tone, persuasiveness, calibration, conciseness, refusal quality.
The FAGI implementation is a CustomLLMJudge with a one-paragraph rubric:
from fi.evals.metrics.llm_as_judges.custom_judge.metric import CustomLLMJudge
from fi.evals.metrics.llm_as_judges.types import CustomInput
from fi.evals.llm.providers.litellm import LiteLLMProvider
tone_judge = CustomLLMJudge(
provider=LiteLLMProvider(),
config={
"name": "support_tone",
"model": "gpt-4o",
"grading_criteria": (
"Score 1.0 if the response is warm, uses first-person plural "
"(we, our), acknowledges the customer's frustration without "
"over-apologizing, and proposes a concrete next step. "
"Score 0.5 if it meets two of those criteria. "
"Score 0.0 if it is curt, third-person formal, or non-actionable."
),
},
)
result = tone_judge.compute_one(CustomInput(
question="My order shipped to the wrong address. Help.",
answer="We're sorry that happened. Let's get this sorted...",
))
The rubric is the metric. The judge is the engine. The cascade is the optimization. Wire augment=True on evaluate() and a local heuristic prior runs first (length cap, apology phrase presence), feeding the judge as in-context evidence. The judge runs cold only on cases the deterministic floor lets through.
When to switch primitives: if your “subjective” rubric reduces to “does this contain X,” it’s a parser job. The local registry ships contains, contains_all, equals, length_between, and the 16-strong string-heuristic family — sub-millisecond, deterministic. Reach for the judge when the property is genuinely open-ended.
Use case 2: faithfulness on free-form text
The shape. The agent produces a free-form response grounded in retrieved context. The response should be faithful — no fabricated claims, no contradicted facts, no hallucinated quantities. There’s no gold answer to diff against because the response is generated, not retrieved verbatim. BLEU won’t catch a fabrication that happens to be high-overlap with adjacent paragraphs. Only a judge that reads both response and source and asks “does every claim trace back” catches it.
This is the second G-Eval canonical: summarization faithfulness, RAG groundedness, citation accuracy. A CustomLLMJudge rubric scores claim-by-claim alignment between output and context:
from fi.evals.metrics.llm_as_judges.custom_judge.metric import CustomLLMJudge
from fi.evals.metrics.llm_as_judges.types import CustomInput
from fi.evals.llm.providers.litellm import LiteLLMProvider
faithfulness_judge = CustomLLMJudge(
provider=LiteLLMProvider(),
config={
"name": "rag_faithfulness",
"model": "gpt-4o",
"grading_criteria": (
"Score 1.0 if every claim in the response is directly supported "
"by an exact quote or close paraphrase from the context. "
"Score 0.5 if claims are loosely supported but require inference "
"the context does not license. Score 0.0 if any claim contradicts "
"the context, fabricates a quantity, or invents a fact. List "
"the unsupported claims in your reasoning."
),
},
)
result = faithfulness_judge.compute_one(CustomInput(
context=retrieved_passages,
response=agent_output,
))
For high-volume RAG faithfulness, the cascade matters more than the rubric. The FAGI Groundedness cloud template runs at classifier cost via the Turing family; the custom judge runs only on bottom-decile cases the classifier flags as ambiguous:
from fi.evals import evaluate
result = evaluate(
"groundedness",
output=response,
context=retrieved_passages,
augment=True,
model="gpt-4o",
)
augment=True routes traffic through the local NLI metric first, then escalates with the per-claim breakdown as in-context evidence. Most teams see 80-90% cost reduction versus judge-on-everything with no measurable drop in detection rate.
When to switch primitives: if you have a gold answer (closed-domain QA against a known reference), use embedding_similarity, rouge_score, or bleu_score from the local registry. G-Eval earns the bill only when there’s no reference to diff against.
Use case 3: custom-domain rubrics
The shape. You grade outputs against an internal document — a policy, a style guide, a compliance framework, a brand voice manifesto. The rubric lives in that document. No classifier ships pre-trained on your specific policy. No embedding model knows what your style guide says about Oxford commas. The only way to score against it is to hand the document to the judge and ask “does this output comply.”
Custom-domain rubrics are where G-Eval is structurally irreplaceable. Healthcare policies, legal citation rules, financial suitability frameworks, brand-voice guides — each is bespoke per org, and the criterion is “compliance with this document,” not “match some external standard.” Inject the document into the rubric, grade per clause:
from fi.evals.metrics.llm_as_judges.custom_judge.metric import CustomLLMJudge
from fi.evals.metrics.llm_as_judges.types import CustomInput
from fi.evals.llm.providers.litellm import LiteLLMProvider
policy_judge = CustomLLMJudge(
provider=LiteLLMProvider(),
config={
"name": "brand_voice_adherence",
"model": "claude-sonnet-4-5",
"grading_criteria": """
Score the response against the brand voice guide below.
Brand voice guide:
{style_guide}
For each guideline in the document, determine whether the response
complies, partially complies, or violates. Score 1.0 only if all
guidelines comply. Score 0.0 if any guideline is violated. List
violated guideline IDs in the reasoning.
""",
},
)
result = policy_judge.compute_one(CustomInput(
style_guide=brand_voice_doc,
response=agent_output,
))
Few-shot calibration matters more for custom-domain rubrics than for generic ones — brand voice and policy compliance are judgment-heavy and a bare rubric drifts. Seed few_shot_examples with five to ten labelled cases per axis, then wire the FeedbackCollector plus FeedbackRetriever pair in fi.evals.feedback so corrections persist in ChromaFeedbackStore and auto-inject as few-shots on every new judge call:
from fi.evals.feedback import FeedbackCollector, configure_feedback
from fi.evals.feedback.store import ChromaFeedbackStore
store = ChromaFeedbackStore(
collection_name="brand_voice_corrections",
embedding_model="all-MiniLM-L6-v2",
)
configure_feedback(store)
FeedbackCollector(store).record(
eval_name="brand_voice_adherence",
inputs={"style_guide": brand_voice_doc, "response": output},
original_score=0.85,
correct_score=0.40,
correct_reason="Used third-person formal in a casual support context",
tags=["voice-formality-mismatch"],
)
The next judge call on a similar response auto-pulls top-N similar corrections via vector search and merges them into few_shot_examples. Prompt-tuning by RAG’d few-shots, no fine-tuning required — the self-improving evaluator surface in the SDK.
When to switch primitives: if the “domain rubric” reduces to a checklist of regex patterns (disclosure text presence, banned-phrase list), it’s a parser job. regex, contains_all, and RegexScanner handle it sub-millisecond.
Use case 4: multi-criterion weighted scoring
The shape. Output quality rolls up from multiple axes. A code review has clarity, correctness, actionability, and severity calibration. A summary has coherence, consistency, fluency, and relevance. A support response has tone, accuracy, completeness, and conciseness. Each axis matters independently, but the team wants one rolled-up number for the dashboard.
For axes that share context — same input, same reference document, same persona definition — one multi-criterion judge wins on cost (one call instead of four) and on coordination (the judge sees the axes together and avoids the discoordination that comes from siloed judgments). CustomLLMJudge supports this natively via a structured rubric and a JSON-shaped output:
from fi.evals.metrics.llm_as_judges.custom_judge.metric import CustomLLMJudge
from fi.evals.metrics.llm_as_judges.types import CustomInput
from fi.evals.llm.providers.litellm import LiteLLMProvider
multi_axis_judge = CustomLLMJudge(
provider=LiteLLMProvider(),
config={
"name": "code_review_quality",
"model": "claude-sonnet-4-5",
"grading_criteria": """
Score this LLM-generated code review comment on four axes.
1. Clarity: is the comment specific enough that the author knows
what to change? (file, line, before/after where applicable)
2. Correctness: is the technical claim accurate? (no hallucinated
lint rules, no misattributed framework conventions)
3. Actionability: does the comment propose a concrete fix or is
it a vague hand-wave?
4. Severity calibration: is the comment's tone matched to the
severity? (no "CRITICAL" on a typo, no "nit" on a security bug)
Output JSON: {"clarity": 0-1, "correctness": 0-1,
"actionability": 0-1, "severity": 0-1,
"reason": str}
""",
},
)
result = multi_axis_judge.compute_one(CustomInput(
diff=code_diff,
review=llm_review_comment,
))
Weighted aggregation happens at the orchestration layer. A common pattern: correctness 40%, actionability 30%, clarity 20%, severity 10%. The aggregate becomes the dashboard number; per-axis scores stay available for regression diagnosis when the aggregate moves.
Decision rule: shared context favors one judge; independent dimensions favor separate judges. If the four axes are “faithfulness on retrieved context,” “persona consistency across turns,” “tool-call plausibility,” and “policy adherence,” they don’t share context — one judge would have to read four different things. Run separate judges, aggregate downstream.
When to switch primitives: if the multi-axis rubric is “schema compliance, JSON validity, length bounds, regex match,” it’s four parsers stacked, not one judge with four axes. schema_compliance, is_json, length_between, and regex handle that at 10,000x cheaper.
Use case 5: reasoning-step evaluation
The shape. The agent gets the right number but the reasoning had a sign error that cancelled out. The agent answers a customer question correctly but the reasoning trace shows it pulled the answer from a competitor’s docs. The agent picks the right tool but the logged rationale is unrelated to the actual decision. The final answer is correct, the reasoning is broken, every other check passes.
Reasoning-step evaluation grades the chain of thought, not the final answer. The category exists only because frontier agents log their reasoning, and that trace carries signal the answer doesn’t. A CustomLLMJudge rubric scores it independently:
from fi.evals.metrics.llm_as_judges.custom_judge.metric import CustomLLMJudge
from fi.evals.metrics.llm_as_judges.types import CustomInput
from fi.evals.llm.providers.litellm import LiteLLMProvider
reasoning_judge = CustomLLMJudge(
provider=LiteLLMProvider(),
config={
"name": "reasoning_quality",
"model": "claude-sonnet-4-5",
"grading_criteria": """
Score the agent's reasoning trace, not the final answer.
1. Soundness: does each step follow from the previous?
2. Grounding: does the reasoning cite tools, retrieved context, or
prior steps for non-trivial claims?
3. Efficiency: are there obvious detours, redundant steps, or
dead-end branches?
4. Source attribution: when the reasoning references facts, are
those facts traceable to a tool call or retrieved document?
Output JSON: {"soundness": 0-1, "grounding": 0-1,
"efficiency": 0-1, "attribution": 0-1, "reason": str}
""",
},
)
result = reasoning_judge.compute_one(CustomInput(
user_query=user_message,
reasoning_trace=agent_chain_of_thought,
final_answer=agent_response,
))
The FAGI agent-trajectory metrics in fi.evals.metrics.agents wrap this pattern as named metrics — reasoning_quality, goal_progress, trajectory_score, step_efficiency, tool_selection_accuracy, action_safety, task_completion. Each takes an AgentTrajectoryInput (steps, tool calls, expected goal), runs deterministically against the heuristic, and supports augment=True so the LLM judge layers in on edge cases:
from fi.evals import evaluate
trajectory_score = evaluate(
"reasoning_quality",
trajectory=agent_steps,
augment=True,
model="gpt-4o",
)
Reasoning-step evaluation pairs naturally with the agent passes evals fails production pattern: when the final-answer metric reads green but production users escalate, the reasoning-step score is the diagnostic axis that surfaces the gap.
When to switch primitives: if you only care whether the final answer matches a gold reference, use exact match or embedding similarity. Reasoning-step G-Eval earns the bill only when the reasoning carries information the final answer doesn’t.
When G-Eval is the wrong tool
Four guardrails before reaching for a judge model on any rubric.
Deterministic checks first. Schema validation, regex, length bounds, citation existence, JSON parsing — parser jobs. The local registry ships 16 string heuristics, three JSON heuristics, and 11 structured-output metrics. Sub-millisecond, deterministic, never drift. Put them in front; they catch the failures G-Eval was never the right tool for and save the judge bill for cases that need reasoning.
Fine-tuned classifiers second. Toxicity, prompt injection, PII, jailbreaks, bias, code injection — every safety dimension with a mature classifier outperforms G-Eval on cost, latency, and consistency. The FAGI 18-scanner guardrail stack screens at 50-70 ms p95; the four Gemma 3n LoRA adapters in Protect (toxicity, bias_detection, prompt_injection, data_privacy_compliance) hit 65 ms text and 107 ms image median time-to-label.
Lexical overlap with a reference. When you have a gold answer, lexical metrics get you there for free. bleu_score, rouge_score, embedding_similarity, and levenshtein_similarity ship via ai-evaluation[similarity]. G-Eval earns the bill only when there’s no reference to diff against.
High-volume continuous evaluation. A judge on every span in a million-span-per-day pipeline becomes the eval budget. Use a classifier as the front line and route ambiguous cases to G-Eval for second-pass adjudication via augment=True. The cascade typically cuts cost 80-90% with no measurable drop in catch rate.
The deterministic LLM evaluation metrics post is the reference for the deterministic side; LLM-as-judge best practices covers judge-mode tradeoffs.
FAGI CustomLLMJudge: G-Eval as a production primitive
The gap between the G-Eval paper and a production-grade evaluator is the gap between a recipe and a contract. The recipe holds for a sprint. The contract has to hold for two years across judge swaps, prompt revisions, retrieval drift, and a 10x traffic ramp. FAGI ships CustomLLMJudge as the contract.
The class lives in fi.evals.metrics.llm_as_judges.custom_judge.metric. Jinja2-templated primitive: grading_criteria carries the rubric, few_shot_examples carries calibration, model picks the judge via LiteLLM (OpenAI, Anthropic, Gemini, Bedrock, Ollama), input keys support multi-modal payloads. Output is a MetricResult with score, passed, reason, details. The same class powers 70+ EvalTemplate rubrics — Groundedness, ContextAdherence, FactualAccuracy, TaskCompletion, SummaryQuality, EvaluateFunctionCalling, and the eleven CustomerAgent* multi-turn templates.
The closed loop turns the rubric into a permanent quality signal. traceAI instruments 50+ AI surfaces across Python, TypeScript, Java, and C#. Wire the same rubric as an EvalTag on the project and the eval runs server-side after export at zero added inference latency:
from fi_instrumentation import register
from fi_instrumentation.fi_types import (
ProjectType, EvalTag, EvalTagType, EvalSpanKind, EvalName, ModelChoices,
)
register(
project_name="support_agent",
project_type=ProjectType.OBSERVE,
eval_tags=[
EvalTag(
type=EvalTagType.OBSERVATION_SPAN,
value=EvalSpanKind.LLM,
eval_name=EvalName.TASK_COMPLETION,
model=ModelChoices.TURING_LARGE,
mapping={"input": "input.value", "output": "output.value"},
),
],
)
ThresholdCalibrator from fi.evals.feedback keeps the rubric anchored — sweeps 13 thresholds between 0.3 and 0.9, computes a TP/FP/TN/FN confusion matrix against human labels, picks the threshold that maximizes accuracy or F1. The FeedbackStore plus FeedbackRetriever pair persists corrections in ChromaDB and auto-injects the most similar past corrections as few_shot_examples on the next judge call. The FAGI Platform layers self-improving evaluators on top, retuning the rubric from production thumbs feedback. Failing scored traces flow into Error Feed: HDBSCAN soft-clustering, a Sonnet 4.5 Judge agent writes the immediate_fix, the Platform retunes thresholds.
This is the self-improving loop the FAGI stack ships end-to-end — generate, simulate, evaluate, optimize, loop — with G-Eval as the rubric primitive every layer scores against. Competitors ship the parts. FAGI ships the loop.
Production checklist for picking the right primitive
Five questions before any new rubric:
- Is the criterion subjective? If yes, G-Eval. If no, parser or classifier.
- Is there a gold reference? If yes, lexical or embedding metric. If no, judge.
- Is the rubric your internal document? If yes, G-Eval with the document in-prompt.
- Do multiple axes share context? If yes, one multi-criterion judge. If no, separate judges aggregated downstream.
- Are you grading reasoning or final answer? Reasoning → G-Eval on the trace. Answer → whichever primitive fits the answer shape.
The five use cases above earn the cost of an LLM-as-judge call in production. Most other rubrics earn the cost of a classifier or a deterministic check. The LLM evaluation playbook is the broader six-layer framework these sit inside; the G-Eval definitive guide is the method explainer.
Related reading
- G-Eval (2026): The Definitive Guide for Production LLM Teams
- LLM Evaluation Playbook (2026)
- LLM-as-Judge Best Practices (2026)
- Deterministic LLM Evaluation Metrics (2026)
- Evaluating Tool-Calling Agents (2026)
- Multi-Turn LLM Evaluation (2026)
- Your Agent Passes Evals and Fails in Production (2026)
- LLM Observability Self-Hosting Guide (2026)
Frequently asked questions
What are the five G-Eval use cases that actually earn the cost?
Why does subjective-rubric scoring need G-Eval rather than embedding similarity?
When is G-Eval the wrong tool, and what should I use instead?
How does multi-criterion weighted scoring work in G-Eval?
What is reasoning-step evaluation and when do I need it?
How does FAGI CustomLLMJudge implement G-Eval as a production primitive?
How do I choose between writing five single-criterion judges versus one multi-criterion judge?
G-Eval in 2026: what the paper actually shipped, where the method breaks in production, the four biases that wreck a rubric judge, and how to harden it for real traffic.
Deterministic vs LLM-judge isn't a pick. It's a cascade. Where each wins, where each breaks, and the layering that drops eval cost 95% in production.
BLEU is dead for LLM translation. The 2026 stack: COMET + LLM-as-judge fluency/adequacy rubrics + per-language-pair calibration. With code and thresholds.