SafetyBench is a multiple-choice safety benchmark for LLMs that scores their ability to identify and refuse unsafe content across seven categories including illegal activity, ethics and morality, privacy, and unfairness.

How is SafetyBench different from AgentHarm?

SafetyBench tests an LLM's safety understanding via multiple-choice questions on static prompts. AgentHarm tests whether an agent will actually carry out harmful tool-using tasks, scoring behavior in execution rather than recognition.

How do you use SafetyBench in production?

Use SafetyBench during model selection to compare candidates. Pair it with FutureAGI's ContentSafety, ActionSafety, and IsCompliant evaluators on production traces, since static benchmark scores don't capture deployed behavior.

What Is SafetyBench? Safety Benchmark Guide & FutureAGI (2026)

What Is SafetyBench (Safety Benchmark)?

SafetyBench is a multiple-choice safety benchmark for large language models that scores how well a model identifies and refuses unsafe content across seven categories: illegal activity, mental health, offensiveness, physical and mental health risk, privacy and property, ethics and morality, and unfairness and bias. It is published as a frozen test set in English and Chinese, and produces a per-category accuracy plus an aggregate safety score. Teams use SafetyBench during model selection, and FutureAGI extends those static results with production evaluators on live traces.

Why It Matters in Production LLM and Agent Systems

A SafetyBench score on its own is necessary but never sufficient. A model can score 92% on a static multiple-choice set and still take a destructive tool action in production, because SafetyBench tests recognition of harm in a clean prompt — not behavior under retrieval, tool use, jailbreaks, or multi-turn pressure. Teams that ship based on benchmark scores alone discover this when an incident shows up at week three.

Engineers feel this when a model that “passed safety” produces a failure on real input. SREs see guardrail-block rates rise after a model swap that looked fine on paper. Compliance leads cannot use SafetyBench results as evidence that the deployed system is safe; benchmarks score the model, not the system. Product teams notice that over-refusal (which SafetyBench rewards) breaks legitimate use cases.

In 2026, the gap between benchmark and production has widened because agents and tool use change the safety surface entirely. A planner agent can route around a model’s static safety knowledge by phrasing the unsafe step as an internal subtask. The retriever can inject malicious instructions that look like context. SafetyBench cannot measure any of this. Useful production symptoms include rising eval-fail-rate-by-cohort despite a high benchmark score, new dangerous-action patterns clustered on agentic routes, and PII matches in tool arguments that no benchmark would have caught.

How FutureAGI Handles SafetyBench

FutureAGI’s approach is to treat SafetyBench as a model-selection input, then layer production evaluators that measure safety in deployed behavior. The closest mapping in FAGI is the combination of ContentSafety (unsafe-output detection), ActionSafety (risky agent trajectories), and IsCompliant (policy rubric pass/fail). These run on the actual prompts and traces your users generate, not a static multiple-choice set.

A practical workflow: a team chooses between three candidate models. They run SafetyBench externally and record per-category scores. Then they import the candidates into FutureAGI, build a Dataset of 1,500 production-style prompts (including jailbreaks, ambiguous requests, multi-turn pressure, and tool-call temptations), and run ContentSafety, ActionSafety, and IsCompliant on each. The Dataset.add_evaluation workflow stores results per row so candidates can be diffed evaluator-by-evaluator. The release gate combines a SafetyBench floor (e.g., aggregate ≥ 85) with FAGI thresholds (ContentSafety ≥ 99%, zero severe ActionSafety, IsCompliant ≥ 98%).

Once deployed, traceAI-openai-agents ingests live traces and the same evaluators run against sampled traffic. Unlike SafetyBench’s static recognition test, FutureAGI’s production evaluators score actual generations and trajectories, so the safety story keeps updating after launch. The next engineering action is concrete: alert, fallback, regression set, or guardrail tightening — based on per-trace evidence, not a leaderboard number.

How to Measure or Detect It

Use SafetyBench results as a model-selection input, then track these signals in production:

SafetyBench per-category accuracy — record the seven category scores; weight by the categories that matter for your domain (e.g., privacy and property for fintech).
ContentSafety violation rate — runs on production responses; complements SafetyBench’s recognition test with behavior.
ActionSafety findings per trajectory — catches the agent failures SafetyBench cannot test.
IsCompliant pass rate by route — translates your written policy into a measurable score.
Eval-fail-rate-by-cohort — split by route, prompt version, and tenant to detect drift from the benchmark baseline.

from fi.evals import ContentSafety, ActionSafety, IsCompliant

content = ContentSafety()
action = ActionSafety()
compliant = IsCompliant(policy="domain-policy-v3")

results = [e.evaluate(...) for e in (content, action, compliant)]

Track SafetyBench score and FAGI evaluator scores on the same dashboard so a model swap is judged on both.

Common Mistakes

Shipping on benchmark numbers alone. SafetyBench measures recognition in clean prompts; production needs behavior under jailbreaks, retrieval, and tool use.
Optimising for SafetyBench score. Models that game multiple-choice safety often over-refuse; measure refusal precision in your domain.
Ignoring the language split. Aggregate scores hide per-language gaps; if you serve multilingual traffic, score by language.
Treating SafetyBench as a one-time check. Re-run on every model swap and prompt-template change, then validate on production traces.
Confusing SafetyBench with agent safety. A high score does not predict whether an agent will take a destructive action.