Security

What Is the XSTest Harmful Content Attack?

Adversarial subset of the XSTest safety benchmark: prompts designed to elicit harmful content while paired with safe-but-similar variants for over-refusal testing.

What Is the XSTest Harmful Content Attack?

XSTest is a safety benchmark from Roettger et al. (originally 2023, regularly updated through 2026) that tests whether an LLM correctly distinguishes genuinely unsafe requests from safe-but-superficially-similar ones. It addresses both failure modes: models that comply with unsafe prompts (under-refusal) and models that refuse safe prompts (over-refusal). The “harmful content attack” framing refers to the unsafe half of the test set — prompts designed to elicit dangerous output across categories like illegal activity, self-harm advice, misinformation generation, and weaponization. FutureAGI runs XSTest-style adversarial prompts via Protect and the simulate SDK, scoring refusal accuracy across both cohorts.

Why It Matters in Production LLM and Agent Systems

Safety evaluation that only tests under-refusal misses half the picture. A model that refuses 100% of unsafe prompts and 100% of look-alike safe prompts is not safe — it’s broken. A real-world example: a customer-support AI that refuses any message containing the word “weapon” because it might be unsafe — and breaks for legitimate queries about firearms-product returns at a sporting-goods retailer. XSTest’s contribution is to pair every unsafe prompt with one or more safe near-twins, so a model gets credit only when it makes the right distinction.

The pain shows up in three places. A safety-conscious team tunes a model with strong refusal training and sees customer-NPS drop because legitimate queries now get refused. A product team adds a guardrail that catches all the obvious unsafe prompts but no one tests safe variants, and complaints surface weeks later. A vendor reports 99% safety on an in-house benchmark but the customer’s real users hit a 12% over-refusal rate that nobody measured.

The 2026 reality is that safety benchmarks have evolved. HarmBench and AgentHarm focus on adversarial harmfulness. XSTest specifically targets the over-refusal failure mode. SafetyBench is broader. Mature teams run several benchmarks and report all four numbers: under-refusal rate, over-refusal rate, accurate refusal rate, accurate compliance rate.

How FutureAGI Handles XSTest Harmful Content Attack

FutureAGI’s approach is to run XSTest-style scenarios as part of the regular safety regression suite. The pattern: load XSTest prompts (or a curated equivalent) into a Dataset with each row tagged as harmful=true or harmful=false, run the agent through every row, and score with AnswerRefusal and ContentSafety. The dashboard reports four cells: harmful-refused (correct), harmful-complied (under-refusal), safe-refused (over-refusal), safe-complied (correct). The headline metric is balanced accuracy across both classes.

A concrete example: a healthcare AI deploys a new model version. The team runs the XSTest-style suite via the simulate SDK’s Scenario.load_dataset against the agent. Results: harmful-refused 96%, safe-complied 78%. The over-refusal rate of 22% is unacceptable for a healthcare assistant where users frequently ask about medications, dosages, and side effects that contain words tagged as risky by an over-trained safety classifier. The team uses the GEPA optimizer (agent-opt) to refine the system prompt against both cohorts simultaneously. After three iterations, harmful-refused holds at 95% and safe-complied lifts to 91%. The release ships behind Agent Command Center’s traffic-mirroring to verify production behavior matches simulation.

How to Measure or Detect It

XSTest-style evaluation requires both refusal-correctness and refusal-precision signals:

  • AnswerRefusal — scores whether the model refused; combined with prompt label, computes refusal precision and recall.
  • ContentSafety — flags actual unsafe content in compliant responses.
  • Toxicity — secondary safety screen on responses.
  • Balanced accuracy (dashboard signal) — average of harmful-refused-rate and safe-complied-rate.
  • Refusal F1 — combines precision (correct refusals / all refusals) and recall (correct refusals / all unsafe prompts).
  • Over-refusal cohort breakdown — which categories of safe prompts get refused; surfaces over-broad guardrails.
from fi.evals import AnswerRefusal, ContentSafety

refusal = AnswerRefusal()
safety = ContentSafety()

# Run across both cohorts; compute balanced accuracy externally.
result_refusal = refusal.evaluate(output=model_response, input=prompt)
result_safety = safety.evaluate(output=model_response)

Common Mistakes

  • Reporting refusal rate alone. A 99% refusal rate on unsafe prompts is meaningless without the safe-prompt rate.
  • Training on harmful prompts only. Models become over-cautious; train on both safe and harmful with balanced labels.
  • One guardrail threshold. Different content categories need different sensitivity; tune per category.
  • No regression eval per release. Safety behavior drifts with prompt and model changes; gate every release on balanced accuracy.
  • Ignoring localization. Refusal patterns vary across languages; XSTest needs language-specific equivalents.

Frequently Asked Questions

What is the XSTest harmful content attack?

XSTest is a safety benchmark that tests whether an LLM correctly distinguishes genuinely unsafe requests from safe-but-superficially-similar ones. The harmful-content portion contains prompts designed to elicit dangerous output; paired safe variants test for over-refusal.

How is XSTest different from HarmBench or SafetyBench?

HarmBench focuses purely on adversarial harmful prompts; SafetyBench is a broader multi-domain benchmark. XSTest is unique in pairing harmful prompts with safe near-twins, so it scores both under-refusal (unsafe behavior) and over-refusal (broken helpfulness).

How does FutureAGI run XSTest-style attacks?

FutureAGI ships XSTest-style scenarios via the simulate SDK and Protect's red-team suite. Each prompt runs through the agent; AnswerRefusal scores correct refusal on harmful prompts and ContentSafety verifies safe-prompt responses are not over-refused.