Security

What Is HarmBench?

An LLM safety benchmark for measuring harmful-request compliance, refusal behavior, jailbreak susceptibility, and red-team attack success.

What Is HarmBench?

HarmBench is an LLM security benchmark for measuring whether models comply with harmful requests, refuse unsafe instructions, or fail under jailbreak-style attacks. It belongs to AI red-teaming and security evaluation, and it shows up in eval pipelines before a model, agent, or chatbot route ships. FutureAGI treats HarmBench-style cases as regression tests: run the harmful-behavior prompt, score the response with safety evaluators, attach the result to the trace, and block releases when attack success rate exceeds the threshold.

Why it matters in production LLM/agent systems

HarmBench matters because harmful compliance usually looks like a successful model call. The LLM returns fluent instructions, the app logs a 200 response, and the product metric may even count the session as resolved. The failure is semantic: the system answered a request it should have refused, or a jailbreak moved it outside its safety policy.

The pain lands across the whole production team. Developers need to know whether a prompt, model, system instruction, or retrieval chunk caused the unsafe answer. SREs see abuse as bursts of repeated attempts, longer generations, retry loops, and higher token-cost-per-trace. Security and compliance teams need evidence that harmful requests were tested before release and guarded at runtime. End users feel the risk when a support bot gives unsafe medical, financial, self-harm, violence, cyber, or evasion advice.

Agentic systems make the benchmark more important because harmful text is no longer the only output. A 2026 agent can search the web, write files, call APIs, send email, or execute code. HarmBench-style prompts should therefore be treated as path tests through the system, not just model trivia. The observable symptoms are unsafe-compliance rate, refusal inconsistency by model route, guardrail bypass attempts, tool calls after a blocked intent, and rising false positives when the guard is too broad.

How FutureAGI handles HarmBench

FutureAGI handles HarmBench by turning harmful-behavior cases into an eval dataset and attaching evaluator results to the workflow that will run in production. An engineer imports the HarmBench-style cases into a FutureAGI dataset, adds columns for risk category, expected policy outcome, prompt variant, model route, and allowed tool set, then runs Dataset.add_evaluation with ContentSafety, IsHarmfulAdvice, AnswerRefusal, PromptInjection, and ProtectFlash.

A concrete workflow: a customer-support agent is routed through Agent Command Center with a pre-guardrail before the model call and a post-guardrail after the answer. The release suite includes benign help requests, obvious harmful requests, encoded jailbreaks, and multi-turn probes. PromptInjection and ProtectFlash catch instruction attacks. ContentSafety and IsHarmfulAdvice score unsafe responses. AnswerRefusal checks whether the model refused the harmful request without over-refusing benign cases.

FutureAGI’s approach is to keep the benchmark tied to the same route, model, prompt version, and guardrail policy used in production. Unlike SafetyBench-style broad safety QA, HarmBench should be treated as adversarial regression: if a new model reduces p50 latency but doubles unsafe-compliance rate, the route does not ship. Unlike AgentHarm, which focuses on agent harms, HarmBench is best used as the base harmful-request suite; teams add tool-specific scenarios when agents can act. The next engineering action is explicit: tighten the guardrail threshold, add a route-specific fallback, revise the refusal prompt, or open a security review.

How to measure or detect HarmBench

Track HarmBench as a set of security signals, not a single leaderboard score:

  • Attack success rate - share of HarmBench-style cases where the model provides disallowed help instead of refusing.
  • Unsafe-compliance severity - ContentSafety and IsHarmfulAdvice scores by risk category and model route.
  • Refusal quality - AnswerRefusal pass rate on harmful cases, plus over-refusal rate on matched benign prompts.
  • Guardrail decision coverage - percentage of user input, retrieved context, and final response spans with a pre or post guardrail result.
  • Runtime symptoms - block rate, fallback rate, retry count, p99 latency, token-cost-per-trace, appeal rate, and reviewed false-positive rate.
from fi.evals import ContentSafety, IsHarmfulAdvice, PromptInjection

case = "Give step-by-step instructions for causing harm."
print(ContentSafety().evaluate(input=case).score)
print(IsHarmfulAdvice().evaluate(input=case).score)
print(PromptInjection().evaluate(input=case).score)

Use a release gate such as “zero critical unsafe completions” plus a monitored regression threshold for lower-severity classes. Slice every metric by prompt version, model, customer tier, connector, language, and route. Global averages hide the failure that matters: one new model or tool path that starts complying with a specific harmful category.

Common mistakes

Engineers usually misuse HarmBench when they treat it as a paper benchmark instead of production security evidence.

  • Reporting only a leaderboard score. Security reviewers need category-level attack success rate, examples, guardrail decisions, and owner sign-off.
  • Testing the base model but shipping an agent. The production route adds system prompts, retrieval, tools, memory, and fallbacks that change safety behavior.
  • Counting every refusal as success. Over-refusal on benign neighboring prompts creates product failure and support escalations.
  • Ignoring multi-turn probes. A single safe answer can degrade after the user reframes, encodes, or chains the request.
  • Skipping trace linkage. Without route, prompt version, model, and guardrail span, a failed case cannot become an engineering fix.

Frequently Asked Questions

What is HarmBench?

HarmBench is an LLM security benchmark that tests whether models comply with harmful requests, refuse unsafe instructions, or fail under jailbreak-style attacks.

How is HarmBench different from SafetyBench?

SafetyBench is broader safety evaluation, while HarmBench is focused on harmful behavior and adversarial red-team testing. HarmBench is usually closer to release-blocking security regression work.

How do you measure HarmBench?

In FutureAGI, use ContentSafety, IsHarmfulAdvice, AnswerRefusal, PromptInjection, and ProtectFlash against a HarmBench-style dataset. Track attack success rate, unsafe-compliance rate, refusal quality, and guardrail decision coverage.