Evaluation

What Is a Custom Metric?

A product-specific evaluation metric that scores LLM or agent behavior not covered by built-in metrics or public benchmarks.

What Is a Custom Metric?

A custom metric is a product-specific evaluation score that measures LLM or agent behavior a built-in metric cannot express. It is an LLM-evaluation metric that appears in eval pipelines, golden-dataset runs, and sampled production traces when teams need to score domain rules, private rubrics, workflow outcomes, or policy checks. In FutureAGI, custom metrics map to fi.evals.CustomEvaluation, letting engineers convert a rule or judge prompt into a repeatable score with thresholds, alerts, and regression history.

Why Custom Metrics Matter in Production LLM and Agent Systems

Production failures rarely match public benchmark categories. A support agent may answer politely and stay grounded, yet still miss the required refund disclaimer. A healthcare assistant may cite the right policy, but omit the escalation path that compliance requires. A coding agent may complete the task, but violate an internal rule such as “never edit generated migration files.” Built-in metrics catch the general shape of quality; custom metrics catch the behavior your product contract actually promises.

Ignoring that gap creates false green releases. Developers see Groundedness and AnswerRelevancy passing, product sees users still complaining, and compliance cannot prove that a regulated clause was included. SREs see symptoms as noisy alerts: a spike in manual review, higher escalation rate for one cohort, eval-fail-rate concentrated on a route, or traces where the final answer looks acceptable but a required workflow step is missing.

The problem is sharper in 2026 agentic systems because the unit of quality is not just one answer. It is a trajectory: plan, retrieval, tool call, validation, final response. A custom metric can grade whether the right evidence was used, whether a tool sequence followed policy, or whether a handoff happened before the agent guessed. Compared with a fixed Ragas-style faithfulness score, a custom metric encodes the product-specific obligation that makes the system safe to ship.

How FutureAGI Handles Custom Metrics

FutureAGI’s approach is to treat custom metrics as versioned evaluators, not notebook-only scripts. The specific FAGI anchor is eval:CustomEvaluation: the CustomEvaluation framework-eval surface for dynamically creating an evaluation from a builder or decorator. Engineers define the input fields, scoring logic, rubric, and threshold, then attach the evaluator to a Dataset through Dataset.add_evaluation().

A real workflow: a benefits-support agent must tell users the required form, deadline, and escalation path for prior authorization. The team already runs Groundedness to check source support and ContextRelevance to check retrieval quality. They add a custom metric named prior_auth_completeness, backed by CustomEvaluation, that returns a score, label, and reason for those three required elements. Dataset rows include input, output, retrieved_context, expected_policy, cohort, and trace_id.

The same metric can run against sampled production traces from traceAI-langchain. Trace fields such as llm.token_count.prompt and agent.trajectory.step help explain whether failures came from long prompts, missing retrieval, or a wrong tool step. If prior_auth_completeness drops below 0.90 on the state_ca cohort, the engineer blocks the prompt release, sends failed rows to annotation, and adds the metric to an AggregatedMetric release gate. For live traffic, the Agent Command Center can pair the score with a post-guardrail or model fallback policy, but the metric remains the auditable reason.

How to Measure or Detect a Custom Metric

Measure the metric and the metric quality separately:

  • fi.evals.CustomEvaluation result: track the returned score, label, and reason per row or trace.
  • Threshold pass rate: chart pass/fail share by dataset version, prompt version, model route, and user cohort.
  • Dashboard signal: alert on eval-fail-rate-by-cohort, not only the global average.
  • Trace fields: inspect llm.token_count.prompt, agent.trajectory.step, and trace_id when the score changes.
  • Human agreement: compare 50-200 reviewed examples against the metric before it gates releases.
  • User-feedback proxy: monitor thumbs-down rate, escalation rate, and manual-review rate after launch.

Minimal Python:

from fi.evals import CustomEvaluation

policy_fit = CustomEvaluation(
    name="prior_auth_completeness",
    rubric="Pass only if the answer states required form, deadline, and escalation path.",
)
result = policy_fit.evaluate(input=question, output=answer, context=policy_doc)
print(result.score, result.label, result.reason)

Common Mistakes

  • Writing the metric after seeing the release result. Post-hoc metrics are shaped by the answer you wanted; define the gate before the experiment.
  • Mixing too many dimensions. “Quality” is not one metric. Split policy compliance, tone, grounding, and task completion, then aggregate explicitly.
  • Skipping calibration. A custom metric needs human-labeled examples, score distribution review, and a threshold sweep before it blocks deploys.
  • Treating pass/fail as enough. Store the reason field and cohort metadata, or debugging becomes a row-by-row reread.
  • Running it only offline. Offline datasets catch known regressions; sampled production traces catch new failure modes.

Frequently Asked Questions

What is a custom metric?

A custom metric is a product-specific evaluation score for behavior built-in evaluators do not cover. In FutureAGI, `CustomEvaluation` turns private rubrics, rules, or workflow checks into repeatable evals.

How is a custom metric different from an evaluation metric?

An evaluation metric is any score returned by an evaluator. A custom metric is the subset you author for your domain, often when `Groundedness`, `JSONValidation`, or `TaskCompletion` are necessary but not enough.

How do you measure a custom metric?

Run `fi.evals.CustomEvaluation` on a golden dataset or sampled trace cohort, then track score distribution, pass rate, reason codes, and eval-fail-rate-by-cohort. Pair it with a threshold before using it as a release gate.