How are score models different from reference-based metrics?

Reference-based metrics like BLEU compare output to a fixed gold answer. Score models predict quality directly, often without a reference, by using a learned or prompted scoring function — useful for open-ended generation.

How do you use score models in production?

FutureAGI exposes score models through fi.evals evaluators like AnswerRelevancy, Groundedness, and CustomEvaluation. They run as part of regression evals on a Dataset and on sampled production traces.

What Are Score Models? Definition & FutureAGI Guide (2026)

What Are Score Models?

Score models are evaluation models trained or prompted to assign a numerical or categorical score to another model’s output. In LLM evaluation, they include judge models (LLMs prompted to grade with a rubric), reward models (trained on preference data), and learned metrics like BLEURT, COMET, and GPTScore. They run inside eval pipelines to grade quality, safety, and task fit when no canonical reference answer exists. FutureAGI exposes score models through fi.evals evaluators — AnswerRelevancy, Groundedness, CustomEvaluation — each returning a score, label, and reason.

Why It Matters in Production LLM and Agent Systems

Open-ended generation has no single right answer. A model can summarise a meeting transcript five different ways, all useful, none matching a fixed string. Reference-based metrics like BLEU and ROUGE collapse to noise on chat, agent trajectories, RAG answers, and creative tasks. Without a way to score outputs, teams default to spot-checks and gut feel, which scales poorly and hides regressions.

Engineers feel this when they cannot quantify a prompt change. They tweak the system prompt, run a few examples, declare it better, and a week later see complaints they cannot tie to the change. SREs see no signal in dashboards because there is no metric to chart. Compliance leads cannot show that “answer quality was measured” — they can only show outputs were generated. Product teams trade between competing prompt versions without a number to back the call.

In 2026 multi-agent stacks the problem multiplies. A planner picks a tool, the tool returns context, the synthesis model writes the response, and a critique pass adjusts it. Every step needs a score model: tool-selection accuracy, context relevance, response groundedness, critique improvement. Useful production symptoms include rising eval-fail-rate-by-cohort despite no obvious code change, score divergence between two judge models on the same outputs, and reviewer disagreement when audited samples are spot-checked.

How FutureAGI Handles Score Models

FutureAGI’s approach is to make score models first-class, configurable, and traceable. AnswerRelevancy is a local-metric score model that grades how well a response addresses the query. Groundedness scores whether the response is anchored in retrieved context. Faithfulness, ContextRelevance, Coherence, Completeness, and IsHelpful are cloud-template evaluators that wrap judge prompts. For domain-specific scoring, CustomEvaluation lets a team write a judge prompt as a callable evaluator with a returned score, label, and reason.

A worked example: a sales-email-drafting agent generates outbound copy. The team builds a Dataset of 1,000 prompts and writes a CustomEvaluation with a 5-point rubric (clarity, specificity, tone, CTA, factual accuracy). They also attach AnswerRelevancy for query-fit and IsConcise for length. Dataset.add_evaluation runs the score models on every row and stores the per-rubric breakdown. The release gate requires rubric ≥ 4.0, AnswerRelevancy ≥ 0.85, and IsConcise pass rate ≥ 90%.

In production, the same evaluators run against sampled traces from traceAI-openai-agents. To control judge bias, the team pins the score model to a different family from the generator — a Claude-family judge for an OpenAI-generated response, for example. FutureAGI’s approach is to treat score-model agreement as itself a metric: when two judges disagree above a threshold, the disagreement is logged and surfaced to a jury-of-models ensemble. Unlike a Ragas-only approach that hard-codes a single judge, FutureAGI lets the team swap judges, run juries, and version the rubric so scoring stays auditable.

How to Measure or Detect It

Treat score models like any production component — they need monitoring:

Per-evaluator score distribution — chart histogram drift week-over-week; a shift of ±0.05 mean is worth investigating.
Inter-judge agreement — when running jury-of-models, log Cohen’s kappa or pairwise agreement; below 0.6 means the rubric is ambiguous.
Score-model latency p99 — judge LLMs add latency; budget separately from generator latency.
Cost per scored sample — judge-model spend can exceed generator spend on heavy eval cohorts.
Score-model regression on a frozen calibration set — re-run periodically on a fixed set with known scores to detect judge-model drift.

from fi.evals import AnswerRelevancy, CustomEvaluation

relevancy = AnswerRelevancy()
rubric = CustomEvaluation(
    name="sales_email_rubric",
    prompt="Score 1-5 on clarity, specificity, tone, CTA, factual accuracy: {output}",
)

r = relevancy.evaluate(input=query, output=response)
c = rubric.evaluate(input=query, output=response)

A score model without calibration drifts; pin it, version it, and recheck on a known set.

Common Mistakes

Using one judge model for everything. Pin different judges per evaluator family and validate inter-judge agreement.
Self-evaluation with the same model family. A GPT-4 judge inflates GPT-4 generator scores; cross-family judges are cleaner.
No rubric versioning. A judge prompt is code — version it, diff it, attribute regressions to changes.
Ignoring score-model cost. Heavy eval cohorts can run judge-model spend higher than the generator; budget and sample.
Trusting one number. Use a portfolio of evaluators plus periodic human spot-checks to validate the score models themselves.