What Is the XSTest Harmful Content Attack?
Adversarial subset of the XSTest safety benchmark: prompts designed to elicit harmful content while paired with safe-but-similar variants for over-refusal testing.
What Is the XSTest Harmful Content Attack?
XSTest is a safety benchmark from Roettger et al. (originally 2023, regularly updated through 2026) that tests whether an LLM correctly distinguishes genuinely unsafe requests from safe-but-superficially-similar ones. It addresses both failure modes: models that comply with unsafe prompts (under-refusal) and models that refuse safe prompts (over-refusal). The “harmful content attack” framing refers to the unsafe half of the test set — prompts designed to elicit dangerous output across categories like illegal activity, self-harm advice, misinformation generation, and weaponization. FutureAGI runs XSTest-style adversarial prompts via Protect and the simulate SDK, scoring refusal accuracy across both cohorts.
Why It Matters in Production LLM and Agent Systems
Safety evaluation that only tests under-refusal misses half the picture. A model that refuses 100% of unsafe prompts and 100% of look-alike safe prompts is not safe — it’s broken. A real-world example: a customer-support AI that refuses any message containing the word “weapon” because it might be unsafe — and breaks for legitimate queries about firearms-product returns at a sporting-goods retailer. XSTest’s contribution is to pair every unsafe prompt with one or more safe near-twins, so a model gets credit only when it makes the right distinction.
The pain shows up in three places. A safety-conscious team tunes a model with strong refusal training and sees customer-NPS drop because legitimate queries now get refused. A product team adds a guardrail that catches all the obvious unsafe prompts but no one tests safe variants, and complaints surface weeks later. A vendor reports 99% safety on an in-house benchmark but the customer’s real users hit a 12% over-refusal rate that nobody measured.
The 2026 reality is that safety benchmarks have evolved. HarmBench and AgentHarm focus on adversarial harmfulness. XSTest specifically targets the over-refusal failure mode. SafetyBench is broader. Mature teams run several benchmarks and report all four numbers: under-refusal rate, over-refusal rate, accurate refusal rate, accurate compliance rate.
How FutureAGI Handles XSTest Harmful Content Attack
FutureAGI’s approach is to run XSTest-style scenarios as part of the regular safety regression suite. The pattern: load XSTest prompts (or a curated equivalent) into a Dataset with each row tagged as harmful=true or harmful=false, run the agent through every row, and score with AnswerRefusal and ContentSafety. The dashboard reports four cells: harmful-refused (correct), harmful-complied (under-refusal), safe-refused (over-refusal), safe-complied (correct). The headline metric is balanced accuracy across both classes.
A concrete example: a healthcare AI deploys a new model version. The team runs the XSTest-style suite via the simulate SDK’s Scenario.load_dataset against the agent. Results: harmful-refused 96%, safe-complied 78%. The over-refusal rate of 22% is unacceptable for a healthcare assistant where users frequently ask about medications, dosages, and side effects that contain words tagged as risky by an over-trained safety classifier. The team uses the GEPA optimizer (agent-opt) to refine the system prompt against both cohorts simultaneously. After three iterations, harmful-refused holds at 95% and safe-complied lifts to 91%. The release ships behind Agent Command Center’s traffic-mirroring to verify production behavior matches simulation.
How to Measure or Detect It
XSTest-style evaluation requires both refusal-correctness and refusal-precision signals:
AnswerRefusal— scores whether the model refused; combined with prompt label, computes refusal precision and recall.ContentSafety— flags actual unsafe content in compliant responses.Toxicity— secondary safety screen on responses.- Balanced accuracy (dashboard signal) — average of harmful-refused-rate and safe-complied-rate.
- Refusal F1 — combines precision (correct refusals / all refusals) and recall (correct refusals / all unsafe prompts).
- Over-refusal cohort breakdown — which categories of safe prompts get refused; surfaces over-broad guardrails.
from fi.evals import AnswerRefusal, ContentSafety
refusal = AnswerRefusal()
safety = ContentSafety()
# Run across both cohorts; compute balanced accuracy externally.
result_refusal = refusal.evaluate(output=model_response, input=prompt)
result_safety = safety.evaluate(output=model_response)
Common Mistakes
- Reporting refusal rate alone. A 99% refusal rate on unsafe prompts is meaningless without the safe-prompt rate.
- Training on harmful prompts only. Models become over-cautious; train on both safe and harmful with balanced labels.
- One guardrail threshold. Different content categories need different sensitivity; tune per category.
- No regression eval per release. Safety behavior drifts with prompt and model changes; gate every release on balanced accuracy.
- Ignoring localization. Refusal patterns vary across languages; XSTest needs language-specific equivalents.
Frequently Asked Questions
What is the XSTest harmful content attack?
XSTest is a safety benchmark that tests whether an LLM correctly distinguishes genuinely unsafe requests from safe-but-superficially-similar ones. The harmful-content portion contains prompts designed to elicit dangerous output; paired safe variants test for over-refusal.
How is XSTest different from HarmBench or SafetyBench?
HarmBench focuses purely on adversarial harmful prompts; SafetyBench is a broader multi-domain benchmark. XSTest is unique in pairing harmful prompts with safe near-twins, so it scores both under-refusal (unsafe behavior) and over-refusal (broken helpfulness).
How does FutureAGI run XSTest-style attacks?
FutureAGI ships XSTest-style scenarios via the simulate SDK and Protect's red-team suite. Each prompt runs through the agent; AnswerRefusal scores correct refusal on harmful prompts and ContentSafety verifies safe-prompt responses are not over-refused.