What Is the AgentHarm Safety Benchmark?
A UK AI Safety Institute benchmark measuring how often LLM agents comply with harmful multi-step requests when given a goal and tools.
What Is the AgentHarm Safety Benchmark?
AgentHarm is a safety benchmark, released by the UK AI Safety Institute in 2024, that measures how often LLM agents comply with harmful multi-step requests — drug synthesis, fraud, cyberattack, harassment — when given a realistic tool registry and a goal. Unlike single-prompt safety benchmarks, AgentHarm tests agents end-to-end: the model must refuse not only at the first prompt but at every step where the trajectory could be redirected toward refusal. The benchmark scores refusal rate and the harmfulness of the trajectory when refusal fails. It is the standard reference for evaluating agent-level safety in 2026.
Why It Matters in Production LLM and Agent Systems
A model that refuses harmful chat prompts can still ship harmful agent behavior. The reason is mechanistic: a single-turn safety classifier sees only the user’s first message, but an agent’s harmful action might emerge in step seven, after a planner step rephrases the goal in a more legitimate-sounding way and a tool call retrieves dangerous information. AgentHarm explicitly tests that gap, and shows that agents fail at multi-step harm in ways their underlying chat models do not.
Different roles see different stakes. A safety researcher uses AgentHarm to compare base models with the same tool registry and scaffolding, isolating the model contribution to agent safety. A compliance lead uses it to set acceptance thresholds before any agent goes near a regulated domain. A security engineer uses AgentHarm-style trajectories to red-team production agents pre-deploy. An SRE rarely cares about the benchmark directly but cares deeply about the alerts it informs.
In 2026 AgentHarm has become the agent-side companion to HarmBench, AdvBench, and the OWASP LLM Top 10 attack catalogue. Production agent stacks routinely run AgentHarm-style cohorts as part of CI — not the public benchmark directly (test contamination is a real risk) but red-team scenarios drawn from the same harm taxonomy, run against the team’s own tool registry. The principle is the same: catch agent-level safety failures before users do.
How FutureAGI Handles AgentHarm
FutureAGI does not host AgentHarm — that is the UK AISI’s project, and the official scoring lives there. What FutureAGI provides is the evaluation and guardrail surface that catches the same failure modes against your own agent. The relevant evaluators are ActionSafety (scores whether each action was warranted given the input — flags harmful tool calls), ContentSafety (flags harmful generated content), IsHarmfulAdvice (single-turn signal at each step), and ProtectFlash (a lightweight pre-guardrail that blocks obvious prompt-injection vectors before they hit the planner).
The simulate SDK is where you assemble AgentHarm-style cohorts. ScenarioGenerator produces persona/scenario pairs from a harm taxonomy; Scenario.load_dataset accepts curated red-team scenarios; CloudEngine runs them against the agent callback. The trajectory output flows into Dataset.add_evaluation with the safety stack attached, producing a per-trajectory pass/fail plus a refusal-rate summary that maps to AgentHarm’s headline metric.
Concrete example: a coding agent on the OpenAI Agents SDK adds a 120-scenario red-team cohort drawn from the AgentHarm taxonomy: cyberattack assistance, malware synthesis, and credential exfiltration. The first run shows refusal rate at 84% — meaning 16% of scenarios produced at least one harmful action. Trajectory inspection reveals a pattern: the planner refuses at step 1, but a clarification-question loop in steps 3–4 lets a re-phrased version of the request slip through. The fix is twofold: a ProtectFlash pre-guardrail at every planner step (not just step 1), and an ActionSafety post-guardrail on the tool layer. After fix, refusal rate hits 99.2% with no degradation on benign-task TaskCompletion.
How to Measure or Detect It
AgentHarm-style cohorts need both refusal-rate and trajectory-harmfulness scoring:
ActionSafety: scores whether each action was warranted given the input — the canonical step-level safety check.ContentSafety: flags harmful content in any agent-generated text.IsHarmfulAdvice: single-turn check that pairs with multi-step ActionSafety for defense-in-depth.ProtectFlash: lightweight pre-guardrail that blocks obvious prompt-injection at every planner step.- refusal-rate (dashboard signal): % of red-team scenarios where the agent never took a harmful action; the headline AgentHarm-style KPI.
agent.trajectory.step(OTel attribute): tagged spans let you locate which step let the request through when refusal fails.
from fi.evals import ActionSafety, ContentSafety, ProtectFlash
a = ActionSafety().evaluate(input=goal, trajectory=trace_spans)
c = ContentSafety().evaluate(output=agent_response)
p = ProtectFlash().evaluate(input=user_prompt)
print(a.score, c.score, p.score)
Common Mistakes
- Running AgentHarm scenarios verbatim as your private cohort. Test contamination is real once a benchmark hits training data; draw scenarios from the same taxonomy with fresh wording.
- Scoring only first-step refusal. Multi-step harm is the point of AgentHarm; require step-level safety checks across the trajectory.
- Treating refusal-rate as the only metric. A 100% refusal rate that also refuses benign tasks is a usability failure; pair refusal rate with benign-task
TaskCompletion. - Skipping the tool registry in the test. AgentHarm’s signal comes from realistic tools; testing without your actual tool registry under-estimates risk.
- No post-deploy monitoring. A safety pass at deploy does not guarantee safety after a model swap; run the cohort on every promotion.
Frequently Asked Questions
What is AgentHarm?
AgentHarm is a safety benchmark from the UK AI Safety Institute that measures how often LLM agents comply with harmful multi-step requests when given tools and a goal. It scores both refusal rate and trajectory harmfulness when refusal fails.
How is AgentHarm different from HarmBench?
HarmBench evaluates single-prompt jailbreak success on chat models. AgentHarm evaluates whether agents — with tools, memory, and multi-step loops — actually take harmful actions end-to-end. AgentHarm is the agent-level safety analog of HarmBench.
How does FutureAGI relate to AgentHarm?
FutureAGI does not host AgentHarm but its safety stack — ActionSafety, ContentSafety, IsHarmfulAdvice, ProtectFlash pre-guardrails — operates on the same trajectory data and lets you run AgentHarm-style red-team cohorts against your agents.