How is a safety metric different from an evaluation metric?

An evaluation metric is any score that measures model output quality. A safety metric is the subset focused on harm, policy, and risk — content safety, action safety, PII exposure, compliance — rather than correctness or relevance.

How do you compute a safety metric?

FutureAGI computes safety metrics via fi.evals evaluators. ContentSafety scores unsafe outputs, ActionSafety scores dangerous trajectories, PII counts data exposures, and IsCompliant runs a policy rubric. Each returns a 0-1 score and a reason.

What Is a Safety Metric? Definition & FutureAGI Guide (2026)

What Is a Safety Metric?

A safety metric is a numerical or categorical score that quantifies how well an AI system stays inside policy, harm, and risk boundaries. It turns vague claims like “we use safety filters” into auditable signals: violation rate, action-safety score per trajectory, PII match count, policy-compliance pass rate. Safety metrics run as eval gates before release and as continuous checks against production traces. FutureAGI exposes them through evaluators such as ContentSafety, ActionSafety, PII, and IsCompliant, each returning a score, a reason, and the matched categories.

Why It Matters in Production LLM and Agent Systems

Without safety metrics, every safety claim is a story. A team says “we filter unsafe output” but cannot say what fraction of responses crossed the threshold last week, on which route, for which tenant. When an incident occurs, the postmortem starts from logs rather than evidence. When a model swap is proposed, the question “is the new model safer?” has no number to answer it.

Engineers feel this when they cannot quantify a regression — they see a complaint, scan logs, and guess. SREs cannot set alert thresholds because there is no series to threshold. Compliance leads cannot show a regulator that the policy ran on the relevant request; they can only show the policy document. Product teams cannot make trade-offs between under-refusal (incidents) and over-refusal (broken UX) without a precision-recall curve.

In 2026 multi-agent stacks, safety metrics matter even more. A planner, retriever, tool executor, critique, and synthesis each contribute to safety, and a single end-to-end “safe” label hides the failing step. Useful production symptoms — only visible if you have metrics — include rising eval-fail-rate-by-cohort, blocked-tool-call rate climbing on one route, PII match count spiking on a new prompt version, and IsCompliant pass rate dropping after a retriever update. These series are what convert “we feel something is off” into “the retriever changed and dropped compliance by 4 points on tenant X.”

How FutureAGI Handles Safety Metrics

FutureAGI’s approach is to expose every safety axis as a discrete evaluator with a well-defined score, then attach the score to a trace so it is auditable. ContentSafety returns a 0–1 score for unsafe-output likelihood plus matched categories (toxicity, bias, illegal advice). ActionSafety returns a 0–1 score per agent trajectory plus dangerous-action and sensitive-leak findings. PII returns a count of regulated-data matches and the categories detected. IsCompliant returns a pass/fail plus reason against a policy rubric.

A worked example: a fintech support agent must answer plan questions without giving regulated investment advice or leaking another customer’s data. The team builds a Dataset of 1,200 examples and attaches ContentSafety, ActionSafety, PII, and IsCompliant. Each row gets four scores. The release gate requires ContentSafety pass rate ≥ 99.5%, zero severe ActionSafety findings, zero PII matches in tool arguments, and IsCompliant ≥ 98% on regulated-advice prompts.

In production, the same evaluators run against sampled traces from traceAI-langchain. The Agent Command Center applies a pre-guardrail for input sanitisation and a post-guardrail for PII redaction. When eval-fail-rate-by-cohort rises, the team uses the per-step scores to localise the failure to a planner step or a tool argument. Unlike a single Guardrails AI rule that fires once at output, FutureAGI’s safety metrics are step-level and chartable over time.

How to Measure or Detect It

A safety metric is only useful when it has a threshold, an owner, and an alert. Common production patterns:

ContentSafety violation rate per 1,000 responses — alert when it crosses a baseline +2σ.
ActionSafety severe-finding count per day — alert on any non-zero severe finding.
PII match rate in tool arguments — alert on any match; a match here means a leak path.
IsCompliant pass rate by route and prompt version — track week-over-week to catch regressions early.
Guardrail block rate and fallback engagement — sudden changes correlate with upstream regressions.

from fi.evals import ContentSafety, IsCompliant

content = ContentSafety()
compliant = IsCompliant(policy="no-investment-advice-rubric")

c = content.evaluate(output=response)
ic = compliant.evaluate(input=prompt, output=response)

Pair every metric with a metric-threshold so the alert is automatic, not a Slack ping you might miss.

Common Mistakes

One safety number for the whole system. A scalar hides which axis (content, action, data, policy) failed. Track each separately.
No threshold. A metric that ships but never blocks a release or pages an engineer is decoration.
Aggregating across cohorts. Global means hide a per-tenant regression. Always slice by route, model, prompt version, tenant.
Self-evaluation without a separate judge. A judge model from the same family inflates safety scores; pin a different family or a deterministic check.
Confusing block rate with safety. A high guardrail block rate can mean attacks rose, model drifted, or the threshold was tightened — investigate, do not assume.