Evaluation

What Are Score Models?

Evaluation models trained or prompted to assign a numerical or categorical score to another model's output, approximating human judgment at scale.

What Are Score Models?

Score models are evaluation models trained or prompted to assign a numerical or categorical score to another model’s output. In LLM evaluation, they cover three families: judge models (LLMs prompted with a rubric), reward models (trained on preference data, often inherited from an RLHF/RLAIF stack), and learned metrics like BLEURT, COMET, and GPTScore. They run inside eval pipelines to grade quality, safety, and task fit when no canonical reference answer exists. FutureAGI exposes score models through fi.evals evaluators. AnswerRelevancy, Groundedness, Faithfulness, TaskCompletion, CustomEvaluation. each returning a score, label, and reason that gets attached to the originating trace.

In May 2026, the practical question is not “should I use score models?”. every serious eval stack uses them. but “which model is the judge, how do I know it agrees with humans, and how do I keep it from drifting when the judge model itself ships an update?”

Why score models matter in production LLM and agent systems

Open-ended generation has no single right answer. A model summarises a meeting transcript five different ways, all useful, none matching a fixed string. Reference-based metrics like BLEU and ROUGE collapse to noise on chat, agent trajectories, RAG answers, and creative tasks. Without a way to score outputs, teams default to spot-checks and gut feel, which scales poorly and hides regressions.

Engineers feel this when they cannot quantify a prompt change. They tweak the system prompt, run a few examples, declare it better, and a week later see complaints they cannot tie to the change. SREs see no signal in dashboards because there is no metric to chart. Compliance leads cannot show that “answer quality was measured”. they can only show outputs were generated. Product teams trade between competing prompt versions without a number to back the call.

In 2026 multi-agent stacks the problem multiplies. A planner picks a tool, the tool returns context, the synthesis model writes the response, and a critique pass adjusts it. Every step needs a score model: tool-selection accuracy, context relevance, response groundedness, critique improvement. Useful production symptoms include rising eval-fail-rate-by-cohort despite no obvious code change, score divergence between two judge models on the same outputs, and reviewer disagreement when audited samples are spot-checked.

The 2026 wrinkle: frontier judge models ship updates every few weeks. The same AnswerRelevancy evaluator backed by Claude Opus 4.7 in March returns subtly different scores when Anthropic ships a snapshot revision in April. On stable judging benchmarks like MT-Bench, Chatbot Arena, and FaithBench, even a minor judge-model snapshot can shift mean scores by 0.05-0.12. enough to flip a release gate on a 1,000-row golden cohort. Score-model versioning is now a release-engineering concern, not a one-time pick.

How FutureAGI handles score models

FutureAGI’s approach is to make score models first-class, configurable, and traceable. AnswerRelevancy is a score model that grades how well a response addresses the query. Groundedness scores whether the response is anchored in retrieved context. Faithfulness, ContextRelevance, ContextPrecision, and TaskCompletion are evaluator templates that wrap judge prompts. For domain-specific scoring, CustomEvaluation lets a team write a judge prompt as a callable evaluator with a returned score, label, and reason.

A worked example. A sales-email-drafting agent generates outbound copy. The team builds a Dataset of 1,000 prompts and writes a CustomEvaluation with a 5-point rubric (clarity, specificity, tone, CTA, factual accuracy). They also attach AnswerRelevancy for query-fit and a length rubric for conciseness. Dataset.add_evaluation runs the score models on every row and stores the per-rubric breakdown. The release gate requires rubric ≥ 4.0, AnswerRelevancy ≥ 0.85, and conciseness pass rate ≥ 90%.

In production, the same evaluators run against sampled traces from traceAI-openai-agents. To control judge bias, the team pins the score model to a different family from the generator. a Claude Opus 4.7 judge for an OpenAI GPT-5.x-generated response, for example. FutureAGI treats score-model agreement as itself a metric: when two judges disagree above a threshold, the disagreement is logged and surfaced to a jury of models ensemble. Unlike a Ragas-only approach that hard-codes a single judge, FutureAGI lets the team swap judges, run juries, and version the rubric so scoring stays auditable.

Comparison of score-model families

FamilyWhat it isStrengthWeakness
Judge LLM (rubric)Frontier model + prose rubricFlexible, works on open-ended tasksDrifts with model updates; same-family bias
Reward modelTrained on preference dataFast, cheap at inferenceBrittle outside training distribution
Learned metric (BLEURT, COMET)Fine-tuned scorerCheap, deterministicSaturates fast; weak on chat/agents
NLI judgeEntailment classifierStrong on factual contradictionsLimited to entailment/contradiction
Jury-of-modelsMultiple judges, votedCuts single-judge biasMore expensive; needs agreement tracking

How to measure score models

Treat score models like any production component. they need monitoring:

  • Per-evaluator score distribution. chart histogram drift week-over-week; a shift of ±0.05 mean is worth investigating, especially after a judge-model snapshot update.
  • Inter-judge agreement. when running a jury, log Cohen’s kappa or pairwise agreement; below 0.6 means the rubric is ambiguous.
  • Score-model latency p99. judge LLMs add latency; budget separately from generator latency.
  • Cost per scored sample. judge-model spend can exceed generator spend on heavy eval cohorts.
  • Score-model regression on a frozen calibration set. re-run periodically on a fixed set with known scores to detect judge-model drift.
  • Judge-vs-human alignment. quarterly, sample 100 rows, have humans grade against the same rubric, compute kappa against the judge.
from fi.evals import AnswerRelevancy, CustomEvaluation

relevancy = AnswerRelevancy()
rubric = CustomEvaluation(
    name="sales_email_rubric",
    rubric="Score 1-5 on clarity, specificity, tone, CTA, factual accuracy.",
    judge_model="claude-opus-4.7",
)

r = relevancy.evaluate(input=query, output=response)
c = rubric.evaluate(input=query, output=response)

A score model without calibration drifts; pin the judge model snapshot, version the rubric, and recheck on a known set.

Common mistakes

  • Using one judge model for everything. Pin different judges per evaluator family and validate inter-judge agreement.
  • Self-evaluation with the same model family. A GPT-5.x judge inflates GPT-5.x generator scores; cross-family judges are cleaner.
  • No rubric versioning. A judge prompt is code. version it, diff it, attribute regressions to changes.
  • Ignoring score-model cost. Heavy eval cohorts can run judge-model spend higher than the generator; budget and sample.
  • Trusting one number. Use a portfolio of evaluators plus periodic human spot-checks to validate the score models themselves.
  • Not pinning judge-model snapshots. “Claude Sonnet 4.6” can refer to two different revisions a month apart. Pin the snapshot or accept drift.

Frequently Asked Questions

What are score models?

Score models are evaluation models trained or prompted to assign numerical or categorical scores to another model's output. They include judge models, reward models, and learned metrics like BLEURT and COMET.

How are score models different from reference-based metrics?

Reference-based metrics like BLEU compare output to a fixed gold answer. Score models predict quality directly, often without a reference, by using a learned or prompted scoring function. useful for open-ended generation.

How do you use score models in production?

FutureAGI exposes score models through fi.evals evaluators like AnswerRelevancy, Groundedness, and CustomEvaluation. They run as part of regression evals on a Dataset and on sampled production traces.