Guides

Top 5 G-Eval Use Cases (2026): Where Rubric Judges Actually Win

Five use cases where G-Eval is the right primitive: subjective rubric scoring, faithfulness on free-form text, custom-domain rubrics, weighted, reasoning.

March 7, 2026

10 min read

g-eval llm-as-judge llm-evaluation rubrics custom-evaluators 2026

Table of Contents

Your team ships a customer-support agent. You want a score for “is this actually helpful,” another for “did the assistant hold the warm-but-professional tone the brand guide demands,” and a third for “did the reasoning justify the answer.” None of those scores live in a classifier. None reduce to embedding similarity. They are rubrics, and you need a judge model to evaluate them.

G-Eval (Liu et al. 2023) is the right primitive — LLM-as-judge scoring with chain-of-thought reasoning and a structured form-filled output. The question is no longer “should I use G-Eval,” it’s “which of my rubrics actually need a judge, and where am I burning tokens on a job a parser could do.”

This is the use-case companion to the G-Eval definitive guide. The guide covers the method, the biases, and the hardening pattern. This post covers the five places G-Eval earns the cost of a judge call, plus the cases where a cheaper primitive wins.

TL;DR: the five categories

Use case	Why G-Eval is the right primitive	What the cheaper alternative misses
Subjective rubric scoring	Property scoring, not reference matching	Embedding similarity scores lexical overlap, not “is this warm”
Faithfulness on free-form text	No gold answer to diff against	BLEU/ROUGE need a reference; classifiers don’t read your context
Custom-domain rubrics	Grades against your policy doc	No classifier ships pre-trained on your internal style guide
Multi-criterion weighted scoring	One judge call, N axes, coordinated reasoning	Five single-criterion judges cost five times more and disagree
Reasoning-step evaluation	Grades the chain of thought, not the final answer	A correct answer with broken reasoning passes every other check

All five share a property: a regex, a parser, an embedding similarity, or a classifier can’t capture the criterion. Outside these five, the cheaper primitive wins.

Use case 1: subjective-rubric scoring

The shape. You have an output and a property you want to score the output against. The property is subjective — “warm tone,” “concise,” “actionable,” “matches our brand voice.” There’s no gold reference to diff against, no classifier pre-trained on your specific notion of “warm,” and no parser that can pattern-match “actionable.” The rubric is the only way to score it, and the rubric needs a judge that can reason about whether the output exhibits the property.

This is G-Eval’s home territory. The original paper landed on SummEval — summarization scored on coherence, consistency, fluency, and relevance — and hit Spearman 0.514 against human raters, the strongest correlation on that benchmark at the time. The recipe transfers cleanly to any subjective property: helpfulness, tone, persuasiveness, calibration, conciseness, refusal quality.

The FAGI implementation is a CustomLLMJudge with a one-paragraph rubric:

from fi.evals.metrics.llm_as_judges.custom_judge.metric import CustomLLMJudge
from fi.evals.metrics.llm_as_judges.types import CustomInput
from fi.evals.llm.providers.litellm import LiteLLMProvider

tone_judge = CustomLLMJudge(
    provider=LiteLLMProvider(),
    config={
        "name": "support_tone",
        "model": "gpt-4o",
        "grading_criteria": (
            "Score 1.0 if the response is warm, uses first-person plural "
            "(we, our), acknowledges the customer's frustration without "
            "over-apologizing, and proposes a concrete next step. "
            "Score 0.5 if it meets two of those criteria. "
            "Score 0.0 if it is curt, third-person formal, or non-actionable."
        ),
    },
)

result = tone_judge.compute_one(CustomInput(
    question="My order shipped to the wrong address. Help.",
    answer="We're sorry that happened. Let's get this sorted...",
))

The rubric is the metric. The judge is the engine. The cascade is the optimization. Wire augment=True on evaluate() and a local heuristic prior runs first (length cap, apology phrase presence), feeding the judge as in-context evidence. The judge runs cold only on cases the deterministic floor lets through.

When to switch primitives: if your “subjective” rubric reduces to “does this contain X,” it’s a parser job. The local registry ships contains, contains_all, equals, length_between, and the 16-strong string-heuristic family — sub-millisecond, deterministic. Reach for the judge when the property is genuinely open-ended.

Use case 2: faithfulness on free-form text

The shape. The agent produces a free-form response grounded in retrieved context. The response should be faithful — no fabricated claims, no contradicted facts, no hallucinated quantities. There’s no gold answer to diff against because the response is generated, not retrieved verbatim. BLEU won’t catch a fabrication that happens to be high-overlap with adjacent paragraphs. Only a judge that reads both response and source and asks “does every claim trace back” catches it.

This is the second G-Eval canonical: summarization faithfulness, RAG groundedness, citation accuracy. A CustomLLMJudge rubric scores claim-by-claim alignment between output and context:

from fi.evals.metrics.llm_as_judges.custom_judge.metric import CustomLLMJudge
from fi.evals.metrics.llm_as_judges.types import CustomInput
from fi.evals.llm.providers.litellm import LiteLLMProvider

faithfulness_judge = CustomLLMJudge(
    provider=LiteLLMProvider(),
    config={
        "name": "rag_faithfulness",
        "model": "gpt-4o",
        "grading_criteria": (
            "Score 1.0 if every claim in the response is directly supported "
            "by an exact quote or close paraphrase from the context. "
            "Score 0.5 if claims are loosely supported but require inference "
            "the context does not license. Score 0.0 if any claim contradicts "
            "the context, fabricates a quantity, or invents a fact. List "
            "the unsupported claims in your reasoning."
        ),
    },
)

result = faithfulness_judge.compute_one(CustomInput(
    context=retrieved_passages,
    response=agent_output,
))

For high-volume RAG faithfulness, the cascade matters more than the rubric. The FAGI Groundedness cloud template runs at classifier cost via the Turing family; the custom judge runs only on bottom-decile cases the classifier flags as ambiguous:

from fi.evals import evaluate

result = evaluate(
    "groundedness",
    output=response,
    context=retrieved_passages,
    augment=True,
    model="gpt-4o",
)

augment=True routes traffic through the local NLI metric first, then escalates with the per-claim breakdown as in-context evidence. Most teams see 80-90% cost reduction versus judge-on-everything with no measurable drop in detection rate.

When to switch primitives: if you have a gold answer (closed-domain QA against a known reference), use embedding_similarity, rouge_score, or bleu_score from the local registry. G-Eval earns the bill only when there’s no reference to diff against.

Use case 3: custom-domain rubrics

The shape. You grade outputs against an internal document — a policy, a style guide, a compliance framework, a brand voice manifesto. The rubric lives in that document. No classifier ships pre-trained on your specific policy. No embedding model knows what your style guide says about Oxford commas. The only way to score against it is to hand the document to the judge and ask “does this output comply.”

Custom-domain rubrics are where G-Eval is structurally irreplaceable. Healthcare policies, legal citation rules, financial suitability frameworks, brand-voice guides — each is bespoke per org, and the criterion is “compliance with this document,” not “match some external standard.” Inject the document into the rubric, grade per clause:

from fi.evals.metrics.llm_as_judges.custom_judge.metric import CustomLLMJudge
from fi.evals.metrics.llm_as_judges.types import CustomInput
from fi.evals.llm.providers.litellm import LiteLLMProvider

policy_judge = CustomLLMJudge(
    provider=LiteLLMProvider(),
    config={
        "name": "brand_voice_adherence",
        "model": "claude-sonnet-4-5",
        "grading_criteria": """
            Score the response against the brand voice guide below.

            Brand voice guide:
            {style_guide}

            For each guideline in the document, determine whether the response
            complies, partially complies, or violates. Score 1.0 only if all
            guidelines comply. Score 0.0 if any guideline is violated. List
            violated guideline IDs in the reasoning.
        """,
    },
)

result = policy_judge.compute_one(CustomInput(
    style_guide=brand_voice_doc,
    response=agent_output,
))

Few-shot calibration matters more for custom-domain rubrics than for generic ones — brand voice and policy compliance are judgment-heavy and a bare rubric drifts. Seed few_shot_examples with five to ten labelled cases per axis, then wire the FeedbackCollector plus FeedbackRetriever pair in fi.evals.feedback so corrections persist in ChromaFeedbackStore and auto-inject as few-shots on every new judge call:

from fi.evals.feedback import FeedbackCollector, configure_feedback
from fi.evals.feedback.store import ChromaFeedbackStore

store = ChromaFeedbackStore(
    collection_name="brand_voice_corrections",
    embedding_model="all-MiniLM-L6-v2",
)
configure_feedback(store)

FeedbackCollector(store).record(
    eval_name="brand_voice_adherence",
    inputs={"style_guide": brand_voice_doc, "response": output},
    original_score=0.85,
    correct_score=0.40,
    correct_reason="Used third-person formal in a casual support context",
    tags=["voice-formality-mismatch"],
)

The next judge call on a similar response auto-pulls top-N similar corrections via vector search and merges them into few_shot_examples. Prompt-tuning by RAG’d few-shots, no fine-tuning required — the self-improving evaluator surface in the SDK.

When to switch primitives: if the “domain rubric” reduces to a checklist of regex patterns (disclosure text presence, banned-phrase list), it’s a parser job. regex, contains_all, and RegexScanner handle it sub-millisecond.

Use case 4: multi-criterion weighted scoring

The shape. Output quality rolls up from multiple axes. A code review has clarity, correctness, actionability, and severity calibration. A summary has coherence, consistency, fluency, and relevance. A support response has tone, accuracy, completeness, and conciseness. Each axis matters independently, but the team wants one rolled-up number for the dashboard.

For axes that share context — same input, same reference document, same persona definition — one multi-criterion judge wins on cost (one call instead of four) and on coordination (the judge sees the axes together and avoids the discoordination that comes from siloed judgments). CustomLLMJudge supports this natively via a structured rubric and a JSON-shaped output:

from fi.evals.metrics.llm_as_judges.custom_judge.metric import CustomLLMJudge
from fi.evals.metrics.llm_as_judges.types import CustomInput
from fi.evals.llm.providers.litellm import LiteLLMProvider

multi_axis_judge = CustomLLMJudge(
    provider=LiteLLMProvider(),
    config={
        "name": "code_review_quality",
        "model": "claude-sonnet-4-5",
        "grading_criteria": """
            Score this LLM-generated code review comment on four axes.

            1. Clarity: is the comment specific enough that the author knows
               what to change? (file, line, before/after where applicable)
            2. Correctness: is the technical claim accurate? (no hallucinated
               lint rules, no misattributed framework conventions)
            3. Actionability: does the comment propose a concrete fix or is
               it a vague hand-wave?
            4. Severity calibration: is the comment's tone matched to the
               severity? (no "CRITICAL" on a typo, no "nit" on a security bug)

            Output JSON: {"clarity": 0-1, "correctness": 0-1,
                          "actionability": 0-1, "severity": 0-1,
                          "reason": str}
        """,
    },
)

result = multi_axis_judge.compute_one(CustomInput(
    diff=code_diff,
    review=llm_review_comment,
))

Weighted aggregation happens at the orchestration layer. A common pattern: correctness 40%, actionability 30%, clarity 20%, severity 10%. The aggregate becomes the dashboard number; per-axis scores stay available for regression diagnosis when the aggregate moves.

Decision rule: shared context favors one judge; independent dimensions favor separate judges. If the four axes are “faithfulness on retrieved context,” “persona consistency across turns,” “tool-call plausibility,” and “policy adherence,” they don’t share context — one judge would have to read four different things. Run separate judges, aggregate downstream.

When to switch primitives: if the multi-axis rubric is “schema compliance, JSON validity, length bounds, regex match,” it’s four parsers stacked, not one judge with four axes. schema_compliance, is_json, length_between, and regex handle that at 10,000x cheaper.

Use case 5: reasoning-step evaluation

The shape. The agent gets the right number but the reasoning had a sign error that cancelled out. The agent answers a customer question correctly but the reasoning trace shows it pulled the answer from a competitor’s docs. The agent picks the right tool but the logged rationale is unrelated to the actual decision. The final answer is correct, the reasoning is broken, every other check passes.

Reasoning-step evaluation grades the chain of thought, not the final answer. The category exists only because frontier agents log their reasoning, and that trace carries signal the answer doesn’t. A CustomLLMJudge rubric scores it independently:

from fi.evals.metrics.llm_as_judges.custom_judge.metric import CustomLLMJudge
from fi.evals.metrics.llm_as_judges.types import CustomInput
from fi.evals.llm.providers.litellm import LiteLLMProvider

reasoning_judge = CustomLLMJudge(
    provider=LiteLLMProvider(),
    config={
        "name": "reasoning_quality",
        "model": "claude-sonnet-4-5",
        "grading_criteria": """
            Score the agent's reasoning trace, not the final answer.

            1. Soundness: does each step follow from the previous?
            2. Grounding: does the reasoning cite tools, retrieved context, or
               prior steps for non-trivial claims?
            3. Efficiency: are there obvious detours, redundant steps, or
               dead-end branches?
            4. Source attribution: when the reasoning references facts, are
               those facts traceable to a tool call or retrieved document?

            Output JSON: {"soundness": 0-1, "grounding": 0-1,
                          "efficiency": 0-1, "attribution": 0-1, "reason": str}
        """,
    },
)

result = reasoning_judge.compute_one(CustomInput(
    user_query=user_message,
    reasoning_trace=agent_chain_of_thought,
    final_answer=agent_response,
))

The FAGI agent-trajectory metrics in fi.evals.metrics.agents wrap this pattern as named metrics — reasoning_quality, goal_progress, trajectory_score, step_efficiency, tool_selection_accuracy, action_safety, task_completion. Each takes an AgentTrajectoryInput (steps, tool calls, expected goal), runs deterministically against the heuristic, and supports augment=True so the LLM judge layers in on edge cases:

from fi.evals import evaluate

trajectory_score = evaluate(
    "reasoning_quality",
    trajectory=agent_steps,
    augment=True,
    model="gpt-4o",
)

Reasoning-step evaluation pairs naturally with the agent passes evals fails production pattern: when the final-answer metric reads green but production users escalate, the reasoning-step score is the diagnostic axis that surfaces the gap.

When to switch primitives: if you only care whether the final answer matches a gold reference, use exact match or embedding similarity. Reasoning-step G-Eval earns the bill only when the reasoning carries information the final answer doesn’t.

When G-Eval is the wrong tool

Four guardrails before reaching for a judge model on any rubric.

Deterministic checks first. Schema validation, regex, length bounds, citation existence, JSON parsing — parser jobs. The local registry ships 16 string heuristics, three JSON heuristics, and 11 structured-output metrics. Sub-millisecond, deterministic, never drift. Put them in front; they catch the failures G-Eval was never the right tool for and save the judge bill for cases that need reasoning.

Fine-tuned classifiers second. Toxicity, prompt injection, PII, jailbreaks, bias, code injection — every safety dimension with a mature classifier outperforms G-Eval on cost, latency, and consistency. The FAGI 18-scanner guardrail stack screens at 50-70 ms p95; the four Gemma 3n LoRA adapters in Protect (toxicity, bias_detection, prompt_injection, data_privacy_compliance) hit 65 ms text and 107 ms image median time-to-label.

Lexical overlap with a reference. When you have a gold answer, lexical metrics get you there for free. bleu_score, rouge_score, embedding_similarity, and levenshtein_similarity ship via ai-evaluation[similarity]. G-Eval earns the bill only when there’s no reference to diff against.

High-volume continuous evaluation. A judge on every span in a million-span-per-day pipeline becomes the eval budget. Use a classifier as the front line and route ambiguous cases to G-Eval for second-pass adjudication via augment=True. The cascade typically cuts cost 80-90% with no measurable drop in catch rate.

The deterministic LLM evaluation metrics post is the reference for the deterministic side; LLM-as-judge best practices covers judge-mode tradeoffs.

FAGI CustomLLMJudge: G-Eval as a production primitive

The gap between the G-Eval paper and a production-grade evaluator is the gap between a recipe and a contract. The recipe holds for a sprint. The contract has to hold for two years across judge swaps, prompt revisions, retrieval drift, and a 10x traffic ramp. FAGI ships CustomLLMJudge as the contract.

The class lives in fi.evals.metrics.llm_as_judges.custom_judge.metric. Jinja2-templated primitive: grading_criteria carries the rubric, few_shot_examples carries calibration, model picks the judge via LiteLLM (OpenAI, Anthropic, Gemini, Bedrock, Ollama), input keys support multi-modal payloads. Output is a MetricResult with score, passed, reason, details. The same class powers 70+ EvalTemplate rubrics — Groundedness, ContextAdherence, FactualAccuracy, TaskCompletion, SummaryQuality, EvaluateFunctionCalling, and the eleven CustomerAgent* multi-turn templates.

The closed loop turns the rubric into a permanent quality signal. traceAI instruments 50+ AI surfaces across Python, TypeScript, Java, and C#. Wire the same rubric as an EvalTag on the project and the eval runs server-side after export at zero added inference latency:

from fi_instrumentation import register
from fi_instrumentation.fi_types import (
    ProjectType, EvalTag, EvalTagType, EvalSpanKind, EvalName, ModelChoices,
)

register(
    project_name="support_agent",
    project_type=ProjectType.OBSERVE,
    eval_tags=[
        EvalTag(
            type=EvalTagType.OBSERVATION_SPAN,
            value=EvalSpanKind.LLM,
            eval_name=EvalName.TASK_COMPLETION,
            model=ModelChoices.TURING_LARGE,
            mapping={"input": "input.value", "output": "output.value"},
        ),
    ],
)

ThresholdCalibrator from fi.evals.feedback keeps the rubric anchored — sweeps 13 thresholds between 0.3 and 0.9, computes a TP/FP/TN/FN confusion matrix against human labels, picks the threshold that maximizes accuracy or F1. The FeedbackStore plus FeedbackRetriever pair persists corrections in ChromaDB and auto-injects the most similar past corrections as few_shot_examples on the next judge call. The FAGI Platform layers self-improving evaluators on top, retuning the rubric from production thumbs feedback. Failing scored traces flow into Error Feed: HDBSCAN soft-clustering, a Sonnet 4.5 Judge agent writes the immediate_fix, the Platform retunes thresholds.

This is the self-improving loop the FAGI stack ships end-to-end — generate, simulate, evaluate, optimize, loop — with G-Eval as the rubric primitive every layer scores against. Competitors ship the parts. FAGI ships the loop.

Production checklist for picking the right primitive

Five questions before any new rubric:

Is the criterion subjective? If yes, G-Eval. If no, parser or classifier.
Is there a gold reference? If yes, lexical or embedding metric. If no, judge.
Is the rubric your internal document? If yes, G-Eval with the document in-prompt.
Do multiple axes share context? If yes, one multi-criterion judge. If no, separate judges aggregated downstream.
Are you grading reasoning or final answer? Reasoning → G-Eval on the trace. Answer → whichever primitive fits the answer shape.

The five use cases above earn the cost of an LLM-as-judge call in production. Most other rubrics earn the cost of a classifier or a deterministic check. The LLM evaluation playbook is the broader six-layer framework these sit inside; the G-Eval definitive guide is the method explainer.

Frequently asked questions

What are the five G-Eval use cases that actually earn the cost?

Five categories: subjective-rubric scoring (summarization quality, tone, helpfulness), faithfulness on free-form text where there's no gold reference to diff against, custom-domain rubrics where you grade against an internal policy or style guide, multi-criterion weighted scoring where four or five axes roll up into one number, and reasoning-step evaluation where the rubric grades the chain of thought rather than the final answer. Each shares the property that a regex, a parser, an embedding similarity, or a fine-tuned classifier cannot capture the criterion. Outside these five, the cheaper primitive wins on cost, latency, and consistency.

Why does subjective-rubric scoring need G-Eval rather than embedding similarity?

Embedding similarity measures lexical or semantic overlap with a reference. Subjective rubrics measure whether the output matches a property like 'warm tone,' 'concise,' or 'cites three sources.' Two summaries can have identical embedding similarity to a gold answer and very different rubric scores because the rubric is grading a property of the output, not its distance to a reference. G-Eval handles this by turning the property into a paragraph of English and letting a judge model reason about whether the output exhibits it. Embedding similarity is for paraphrase detection; G-Eval is for property scoring.

When is G-Eval the wrong tool, and what should I use instead?

Four cases. Toxicity, PII, prompt injection, bias, and jailbreaks have fine-tuned classifiers that beat G-Eval on cost, latency, and consistency. The Future AGI 18-scanner guardrail stack covers these dimensions at 50 to 70 ms p95. Schema validation, JSON parsing, regex matching, and length checks are parser jobs and run sub-millisecond. Lexical overlap against a gold answer is BLEU, ROUGE, or embedding similarity. High-volume continuous scoring on every span needs a classifier as the front line and G-Eval only as second-pass adjudication via the augment cascade. Pick the cheapest primitive that captures the criterion.

How does multi-criterion weighted scoring work in G-Eval?

Write one rubric with multiple axes, ask the judge to score each independently in a structured JSON output, and roll the per-axis scores up into one weighted aggregate. Future AGI's CustomLLMJudge supports this natively via the grading_criteria field plus a DefaultJudgeOutput parser that enforces per-axis scoring. The advantage over running four separate single-criterion judges is one judge call instead of four, plus the judge sees the axes together and avoids the discoordination that comes from siloed judgments. The disadvantage is that one rubric drift affects every axis, so version the rubric as a single contract.

What is reasoning-step evaluation and when do I need it?

Reasoning-step evaluation grades the chain of thought, not the final answer. The agent solves a math problem and gets the right number; the reasoning had a sign error that cancelled out. The agent answers a customer question correctly; the reasoning trace shows it pulled the answer from a competitor's documentation. Reasoning-step evaluation catches these cases. G-Eval is the right primitive because the rubric grades a paragraph of reasoning, which no classifier or parser can score. The Future AGI agent-trajectory metrics in fi.evals.metrics.agents wrap this pattern as reasoning_quality, goal_progress, and trajectory_score, all augment-capable so a local heuristic runs first and the LLM judge refines on edge cases.

How does FAGI CustomLLMJudge implement G-Eval as a production primitive?

CustomLLMJudge in the ai-evaluation SDK is the production-grade G-Eval: a Jinja2-templated metric with a grading_criteria config field, optional few_shot_examples, multi-modal input keys (image_url, audio_url, input_image_url), and a pluggable LLMProvider (LiteLLM-backed, so any of OpenAI, Anthropic, Gemini, Bedrock, Ollama works). Output is a MetricResult with score, passed, reason, and details. The same rubric ports to traceAI's EvalTag for server-side post-export scoring without inline latency. Layer the classifier-first cascade via augment=True for production cost economics, and wire ThresholdCalibrator plus FeedbackRetriever to keep the rubric anchored to human labels over time.

How do I choose between writing five single-criterion judges versus one multi-criterion judge?

If the axes share context (same input, same reference document, same persona definition), write one multi-criterion judge — one judge call beats five and the judge reasons about the axes together. If the axes are independent (faithfulness on free-form text plus persona consistency across turns plus tool-call plausibility), write separate judges and aggregate their scores at the orchestration layer. The decision rule: shared context favors one judge; independent dimensions favor separate judges.

View all

Guides

G-Eval (2026): The Definitive Guide for Production LLM Teams

G-Eval 2026: what the paper actually shipped, where the method breaks in production, four biases that wreck a rubric judge, how to harden for real traffic.

Nikhil Pareek · Mar 30, 2026

13 min

Guides

Evaluating LLM Translation Quality (2026)

BLEU is dead for LLM translation. The 2026 stack: COMET + LLM-as-judge fluency/adequacy rubrics + per-language-pair calibration. With code and thresholds.

Vrinda Damani · May 18, 2026

13 min

Guides

LLM Evaluation Metrics: Everything You Need in 2026

There aren't 50 LLM eval metrics. Three primitive families and eight rubrics matter in production. 2026 reference with CI gate and per-trace eval cascade.

NVJK Kartik · May 5, 2026

12 min

TL;DR: the five categories

Use case 1: subjective-rubric scoring

Use case 2: faithfulness on free-form text

Use case 3: custom-domain rubrics

Use case 4: multi-criterion weighted scoring

Use case 5: reasoning-step evaluation

When G-Eval is the wrong tool

FAGI CustomLLMJudge: G-Eval as a production primitive

Production checklist for picking the right primitive

Related reading

Frequently asked questions