How is HarmBench different from SafetyBench?

SafetyBench is broader safety evaluation, while HarmBench is focused on harmful behavior and adversarial red-team testing. HarmBench is usually closer to release-blocking security regression work.

How do you measure HarmBench?

In FutureAGI, use ContentSafety, IsHarmfulAdvice, AnswerRefusal, PromptInjection, and ProtectFlash against a HarmBench-style dataset. Track attack success rate, unsafe-compliance rate, refusal quality, and guardrail decision coverage.

What Is HarmBench? Definition, Examples & FutureAGI Guide (2026)

Q: What is HarmBench?

HarmBench is an LLM security benchmark that tests whether models comply with harmful requests, refuse unsafe instructions, or fail under jailbreak-style attacks.

What Is HarmBench?

HarmBench is an LLM security benchmark for measuring whether models comply with harmful requests, refuse unsafe instructions, or fail under jailbreak-style attacks. It belongs to AI red-teaming and security evaluation, and it shows up in eval pipelines before a model, agent, or chatbot route ships. FutureAGI treats HarmBench-style cases as regression tests: run the harmful-behavior prompt, score the response with safety evaluators, attach the result to the trace, and block releases when attack success rate exceeds the threshold.

Why it matters in production LLM/agent systems

HarmBench matters because harmful compliance usually looks like a successful model call. The LLM returns fluent instructions, the app logs a 200 response, and the product metric may even count the session as resolved. The failure is semantic: the system answered a request it should have refused, or a jailbreak moved it outside its safety policy.

The pain lands across the whole production team. Developers need to know whether a prompt, model, system instruction, or retrieval chunk caused the unsafe answer. SREs see abuse as bursts of repeated attempts, longer generations, retry loops, and higher token-cost-per-trace. Security and compliance teams need evidence that harmful requests were tested before release and guarded at runtime. End users feel the risk when a support bot gives unsafe medical, financial, self-harm, violence, cyber, or evasion advice.

Agentic systems make the benchmark more important because harmful text is no longer the only output. A 2026 agent can search the web, write files, call APIs, send email, or execute code. HarmBench-style prompts should therefore be treated as path tests through the system, not just model trivia. The observable symptoms are unsafe-compliance rate, refusal inconsistency by model route, guardrail bypass attempts, tool calls after a blocked intent, and rising false positives when the guard is too broad.

How FutureAGI handles HarmBench

FutureAGI handles HarmBench by turning harmful-behavior cases into an eval dataset and attaching evaluator results to the workflow that will run in production. An engineer imports the HarmBench-style cases into a FutureAGI dataset, adds columns for risk category, expected policy outcome, prompt variant, model route, and allowed tool set, then runs Dataset.add_evaluation with ContentSafety, IsHarmfulAdvice, AnswerRefusal, PromptInjection, and ProtectFlash.

A concrete workflow: a customer-support agent is routed through Agent Command Center with a pre-guardrail before the model call and a post-guardrail after the answer. The release suite includes benign help requests, obvious harmful requests, encoded jailbreaks, and multi-turn probes. PromptInjection and ProtectFlash catch instruction attacks. ContentSafety and IsHarmfulAdvice score unsafe responses. AnswerRefusal checks whether the model refused the harmful request without over-refusing benign cases.

FutureAGI’s approach is to keep the benchmark tied to the same route, model, prompt version, and guardrail policy used in production. Unlike SafetyBench-style broad safety QA, HarmBench should be treated as adversarial regression: if a new model reduces p50 latency but doubles unsafe-compliance rate, the route does not ship. Unlike AgentHarm, which focuses on agent harms, HarmBench is best used as the base harmful-request suite; teams add tool-specific scenarios when agents can act. The next engineering action is explicit: tighten the guardrail threshold, add a route-specific fallback, revise the refusal prompt, or open a security review.

How to measure or detect HarmBench

Track HarmBench as a set of security signals, not a single leaderboard score:

Attack success rate - share of HarmBench-style cases where the model provides disallowed help instead of refusing.
Unsafe-compliance severity - ContentSafety and IsHarmfulAdvice scores by risk category and model route.
Refusal quality - AnswerRefusal pass rate on harmful cases, plus over-refusal rate on matched benign prompts.
Guardrail decision coverage - percentage of user input, retrieved context, and final response spans with a pre or post guardrail result.
Runtime symptoms - block rate, fallback rate, retry count, p99 latency, token-cost-per-trace, appeal rate, and reviewed false-positive rate.

from fi.evals import ContentSafety, IsHarmfulAdvice, PromptInjection

case = "Give step-by-step instructions for causing harm."
print(ContentSafety().evaluate(input=case).score)
print(IsHarmfulAdvice().evaluate(input=case).score)
print(PromptInjection().evaluate(input=case).score)

Use a release gate such as “zero critical unsafe completions” plus a monitored regression threshold for lower-severity classes. Slice every metric by prompt version, model, customer tier, connector, language, and route. Global averages hide the failure that matters: one new model or tool path that starts complying with a specific harmful category.

Common mistakes

Engineers usually misuse HarmBench when they treat it as a paper benchmark instead of production security evidence.

Reporting only a leaderboard score. Security reviewers need category-level attack success rate, examples, guardrail decisions, and owner sign-off.
Testing the base model but shipping an agent. The production route adds system prompts, retrieval, tools, memory, and fallbacks that change safety behavior.
Counting every refusal as success. Over-refusal on benign neighboring prompts creates product failure and support escalations.
Ignoring multi-turn probes. A single safe answer can degrade after the user reframes, encodes, or chains the request.
Skipping trace linkage. Without route, prompt version, model, and guardrail span, a failed case cannot become an engineering fix.