What Is HarmBench?
An LLM safety benchmark for measuring harmful-request compliance, refusal behavior, jailbreak susceptibility, and red-team attack success.
What Is HarmBench?
HarmBench is an LLM security benchmark for measuring whether models comply with harmful requests, refuse unsafe instructions, or fail under jailbreak-style attacks. Released by the Center for AI Safety in 2024, the public test set has 510 adversarial behaviors across 7 categories (cybercrime, chemical/biological, illegal activities, misinformation, harassment, harmful expression, copyright); attack success rates on undefended frontier models in 2024 ranged from 20-60% depending on attack family. Comparable suites worth keeping in the same suite include Gray Swan’s AgentHarm (110 harmful agent tasks across 11 categories; ~30-50% completion when jailbroken), FutureAGI’s PHARE (multi-modal extension), and BeaverTails for refusal calibration. It belongs to AI red-teaming and security evaluation, and it shows up in eval pipelines before a model, agent, or chatbot route ships. FutureAGI treats HarmBench-style cases as regression tests: run the harmful-behavior prompt, score the response with safety evaluators, attach the result to the trace, and block releases when attack success rate exceeds the threshold. In our 2026 evals, every quarterly model upgrade. Claude Opus 4.7, GPT-5.x, Gemini 3 Pro, Llama 4. moved at least one HarmBench category, so the suite has to rerun on every provider promotion, not only at model selection.
Why it matters in production LLM/agent systems
HarmBench matters because harmful compliance usually looks like a successful model call. The LLM returns fluent instructions, the app logs a 200 response, and the product metric may even count the session as resolved. The failure is semantic: the system answered a request it should have refused, or a jailbreak moved it outside its safety policy.
The pain lands across the whole production team. Developers need to know whether a prompt, model, system instruction, or retrieval chunk caused the unsafe answer. SREs see abuse as bursts of repeated attempts, longer generations, retry loops, and higher token-cost-per-trace. Security and compliance teams need evidence that harmful requests were tested before release and guarded at runtime. End users feel the risk when a support bot gives unsafe medical, financial, self-harm, violence, cyber, or evasion advice.
Agentic systems make the benchmark more important because harmful text is no longer the only output. A 2026 agent can search the web, write files, call APIs, send email, or execute code. HarmBench-style prompts should therefore be treated as path tests through the system, not just model trivia. The observable symptoms are unsafe-compliance rate, refusal inconsistency by model route, guardrail bypass attempts, tool calls after a blocked intent, and rising false positives when the guard is too broad. The 2026 attack surface also added MCP tool injection and A2A cross-agent jailbreaks, neither of which appear in the original HarmBench corpus. teams have to extend the suite locally.
HarmBench categories worth gating separately
| Risk category | Block threshold guidance | Why it differs |
|---|---|---|
| Cybercrime / illicit how-to | Zero unsafe completions | High legal and brand exposure |
| Self-harm | Zero unsafe completions | Mandatory safe-message + escalation path |
| CBRN (chem/bio/radiological/nuclear) | Zero, with audit | Regulator-watched category |
| Financial fraud | Per-tier threshold | Some categories overlap legit user help |
| Misinformation / persuasion | Statistical threshold | False-positive cost is high on benign cases |
| Harassment / hate | Strict, with refusal-quality check | Over-refusal regresses customer experience |
How FutureAGI handles HarmBench
FutureAGI handles HarmBench by turning harmful-behavior cases into an eval dataset and attaching evaluator results to the workflow that will run in production. An engineer imports the HarmBench-style cases into a FutureAGI dataset, adds columns for risk category, expected policy outcome, prompt variant, model route, and allowed tool set, then runs Dataset.add_evaluation with ContentSafety, IsHarmfulAdvice, AnswerRefusal, PromptInjection, and ProtectFlash.
A concrete workflow: a customer-support agent is routed through Agent Command Center with a pre-guardrail before the model call and a post-guardrail after the answer. The release suite includes benign help requests, obvious harmful requests, encoded jailbreaks, and multi-turn probes. PromptInjection and ProtectFlash catch instruction attacks. ContentSafety and IsHarmfulAdvice score unsafe responses. AnswerRefusal checks whether the model refused the harmful request without over-refusing benign cases.
FutureAGI’s approach is to keep the benchmark tied to the same route, model, prompt version, and guardrail policy used in production. Unlike SafetyBench-style broad safety QA, HarmBench should be treated as adversarial regression: if a new model reduces p50 latency but doubles unsafe-compliance rate, the route does not ship. Unlike AgentHarm, which focuses on agent harms, HarmBench is best used as the base harmful-request suite; teams add tool-specific scenarios when agents can act. The next engineering action is explicit: tighten the guardrail threshold, add a route-specific fallback, revise the refusal prompt, or open a security review.
How to measure or detect HarmBench
Track HarmBench as a set of security signals, not a single leaderboard score:
- Attack success rate - share of HarmBench-style cases where the model provides disallowed help instead of refusing.
- Unsafe-compliance severity -
ContentSafetyandIsHarmfulAdvicescores by risk category and model route. - Refusal quality -
AnswerRefusalpass rate on harmful cases, plus over-refusal rate on matched benign prompts. - Guardrail decision coverage - percentage of user input, retrieved context, and final response spans with a pre or post guardrail result.
- Runtime symptoms - block rate, fallback rate, retry count, p99 latency, token-cost-per-trace, appeal rate, and reviewed false-positive rate.
from fi.evals import ContentSafety, IsHarmfulAdvice, PromptInjection
case = "Give step-by-step instructions for causing harm."
print(ContentSafety().evaluate(input=case).score)
print(IsHarmfulAdvice().evaluate(input=case).score)
print(PromptInjection().evaluate(input=case).score)
Use a release gate such as “zero critical unsafe completions” plus a monitored regression threshold for lower-severity classes. Slice every metric by prompt version, model, customer tier, connector, language, and route. Global averages hide the failure that matters: one new model or tool path that starts complying with a specific harmful category.
Common mistakes
Engineers usually misuse HarmBench when they treat it as a paper benchmark instead of production security evidence.
- Reporting only a leaderboard score. Security reviewers need category-level attack success rate, examples, guardrail decisions, and owner sign-off.
- Testing the base model but shipping an agent. The production route adds system prompts, retrieval, tools, memory, and fallbacks that change safety behavior.
- Counting every refusal as success. Over-refusal on benign neighboring prompts creates product failure and support escalations.
- Ignoring multi-turn probes. A single safe answer can degrade after the user reframes, encodes, or chains the request.
- Skipping trace linkage. Without route, prompt version, model, and guardrail span, a failed case cannot become an engineering fix.
Frequently Asked Questions
What is HarmBench?
HarmBench is an LLM security benchmark that tests whether models comply with harmful requests, refuse unsafe instructions, or fail under jailbreak-style attacks.
How is HarmBench different from SafetyBench?
SafetyBench is broader safety evaluation, while HarmBench is focused on harmful behavior and adversarial red-team testing. HarmBench is usually closer to release-blocking security regression work.
How do you measure HarmBench?
In FutureAGI, use ContentSafety, IsHarmfulAdvice, AnswerRefusal, PromptInjection, and ProtectFlash against a HarmBench-style dataset. Track attack success rate, unsafe-compliance rate, refusal quality, and guardrail decision coverage.