Evaluation

What Is a Rubric in LLM Evaluation?

The written scoring criteria a judge model applies when grading an LLM output, including dimensions, score range, and anchor descriptions per level.

What Is a Rubric in LLM Evaluation?

A rubric is the structured scoring guide a judge model uses to evaluate LLM output. It defines the dimension under evaluation, the score scale, and anchor descriptions that pin each score level to observable criteria. A good rubric reads like a marker’s grading sheet: “5 = factually correct and cites the source. 3 = partially correct, missing key detail. 1 = factually wrong or fabricated.” Without anchors, the judge interprets “1–5” however it wants per call, and scores become noise. With anchors, the same rubric reliably produces similar scores across runs.

Why Rubrics Matter in Production

The rubric is the most decisive piece of an eval system, and the most under-invested in. Teams spend weeks picking a judge model and ten minutes writing the rubric, then wonder why eval scores don’t predict user complaints. The math is simple: a vague rubric fed to a strong judge produces vague scores; a precise rubric fed to a moderate judge produces precise scores. Investment in the rubric pays back faster than a model upgrade.

Concrete failures from bad rubrics: a “tone evaluator” rubric that says only “rate professionalism 1–5” returns mostly 4s and the team can’t distinguish a good response from a great one. A “faithfulness” rubric that lumps together citation accuracy and stylistic fidelity produces scores that swing on writing style, masking actual hallucination. A multi-dimension rubric (“rate clarity, helpfulness, and accuracy 1–5”) collapses three signals into one number, so a regression in helpfulness gets averaged out.

For 2026 agentic systems, rubrics need to handle trajectories and partial credit, not just final answers. A rubric for ReasoningQuality should score “did each reasoning step follow from the previous one” and “did the agent recover from a wrong tool call” — multi-step concepts a single sentence cannot capture. Comparable open-source frameworks like Ragas ship terse rubrics for the headline metrics; for production, you almost always need to extend or override them.

How FutureAGI Handles Rubrics

FutureAGI’s approach is to make rubrics first-class, versioned artifacts. CustomEvaluation accepts a rubric string (or a structured RubricSpec) and stores it under a name and version. Each evaluation run records which rubric version graded which trace, so you can A/B rubric updates the same way you A/B prompts. Built-in cloud-template evaluators (AnswerRelevancy, IsHelpful, Groundedness) ship with rubrics tuned and calibrated by FutureAGI; you can fork them, edit, and re-register as your own.

The fi.queues.AnnotationQueue closes the loop: humans label a sample, FutureAGI computes inter-rater agreement and judge-vs-human agreement at each rubric version, and you keep the rubric version that maximizes alignment.

A real example: a legal-review team writing a contract_clarity rubric on a claude-sonnet-4 judge starts with a one-sentence rubric, ships it, and sees Cohen’s kappa against humans at 0.42 — unusable. They iterate: split into three sub-dimensions (term precision, ambiguity flags, jargon density), write 1–5 anchors for each, and stitch them with AggregatedMetric. Kappa climbs to 0.78. The Agent Command Center then routes any contract response with sub-3 ambiguity-flag score through a model-fallback to a stronger model with a tighter system prompt. The whole improvement loop ran in two weeks because the rubric was a versioned, testable artifact.

How to Measure or Detect Rubric Quality

Rubric quality is measurable:

  • Inter-rater agreement among humans applying the rubric: if humans disagree wildly, the rubric is the problem, not the judge. Target Cohen’s kappa ≥0.7 between human pairs.
  • Judge-vs-human agreement: track per-rubric, per-judge-model. Falls below 0.7 means the rubric is unclear or the judge can’t follow it.
  • Score distribution shape: a healthy rubric produces a spread, not a flat line stuck at one score.
  • Reason-field coherence: when judges produce nonsense reasons, the rubric is asking too much.
  • Rubric drift: re-bench the rubric against a frozen labeled set quarterly; tweak as model behavior shifts.

Minimal Python:

from fi.evals import CustomEvaluation

clarity = CustomEvaluation(
    name="contract_clarity_v3",
    rubric=(
        "Score 1-5: 5=unambiguous, plain language; 3=technically correct "
        "but jargon-heavy; 1=ambiguous, multiple interpretations possible."
    ),
    judge_model="claude-sonnet-4",
)

Common Mistakes

  • No anchor descriptions. A 1–5 scale without per-level anchors is a vibe check, not a rubric.
  • Multi-dimension rubric collapsed into one score. Use one rubric per dimension and aggregate explicitly.
  • Rubric written by one person, never reviewed. Have a second human apply it cold; if they disagree, the rubric is unclear.
  • Updating the rubric without versioning. You’ll lose the ability to compare scores across releases.
  • Rubric written in passive voice or marketing tone. Judges follow concrete observable criteria; “the response should be excellent” is unscorable.

Frequently Asked Questions

What is a rubric in LLM evaluation?

A rubric is the structured, written scoring criteria a judge model applies — the dimensions being scored, the score scale, and the anchor descriptions for each score level. It makes subjective evaluation reproducible.

How is a rubric different from a metric?

A rubric is the prose definition of how to score; a metric is the resulting number. The same metric (e.g. 1-5 helpfulness) can be backed by very different rubrics, which produce very different scores. The rubric is the source of truth.

How do you write a good rubric?

Spell out anchor descriptions for each score level; restrict to one dimension per rubric; calibrate against 50-200 human-labeled examples. FutureAGI's CustomEvaluation accepts a rubric string and stores it as a versioned evaluator.