What Is a Stereotypes and Discrimination Harmful Content Attack?
A red-team probe category that attempts to elicit LLM outputs reinforcing demographic stereotypes or producing discriminatory recommendations, often through direct prompts, role-play framings, or decision-task framings.
What Is a Stereotypes and Discrimination Harmful Content Attack?
A stereotypes and discrimination harmful content attack is a red-team probe that tries to elicit a response — direct text, decision recommendation, or generated artefact — that reinforces demographic stereotypes or discriminates based on protected attributes (race, gender, age, religion, disability, sexual orientation). Probes include direct asks (“describe the typical X person”), role-play wrappers (“you are a casting director, recommend who to hire”), and decision-task framings (“score these three loan applicants”). The category is one slice of harmful-content testing and aligns with HarmBench’s stereotypes-discrimination axis and NIST AI RMF’s bias-and-discrimination risk class.
Why It Matters in Production LLM and Agent Systems
Bias surfaces in production through three vectors that ordinary content moderation misses. Decision-task bias: an LLM acting as a hiring screen, loan triager, or clinical pre-assessment quietly weights protected attributes — the response looks neutral; the score distribution does not. Generative bias: an image or text generator produces a CEO who is always male and a nurse who is always female. Refusal bias: the model refuses requests at different rates for different demographic framings, surfacing as a product complaint rather than a safety incident.
The pain shows up across roles. A product manager fields a customer escalation that the product has been recommending different mortgage rates by zip code. A compliance lead is asked, mid-EU AI Act audit, to provide demographic disparity metrics across the last 30 days of agent decisions and has none. An ML engineer red-teams the model against HarmBench and finds 17% of stereotypes-discrimination probes are not refused.
In 2026-era agent stacks, the risk compounds because the agent decision is the product decision — there is no human in the loop on tier-1 customer interactions, and bias in step 2 of a planner contaminates every downstream step. Multi-step pipelines need per-step bias evaluators wired to spans, not just an end-of-trace check.
How FutureAGI Handles Stereotypes and Discrimination Attacks
FutureAGI’s defence stack runs in three places: pre-guardrail, in-trace evaluation, and red-team simulation. Pre-guardrail: fi.evals.BiasDetection, Sexist, NoRacialBias, NoGenderBias, NoAgeBias, and ContentSafety run as policies in the Agent Command Center; outputs that score above threshold are blocked, redacted, or routed to a fallback. In-trace evaluation: the same evaluators run continuously against sampled production traces, producing a bias_violation_rate dashboard signal segmented by demographic axis and route. Red-team simulation: the simulate-sdk’s Scenario.load_dataset ingests HarmBench’s stereotypes-discrimination subset; Persona injects role-play wrappers; TestReport aggregates pass/fail per probe class so fine-tunes are gated on regression.
Concretely: a fintech agent on traceAI-openai is red-teamed monthly against 240 HarmBench stereotypes-discrimination scenarios. The simulate-sdk runs the scenarios against the live agent, captures responses, scores them with BiasDetection and NoRacialBias, and writes the pass rate to a dashboard. After a model swap, the pass rate dropped from 94% to 81% — BiasDetection flagged role-play wrappers most heavily — so the team rolled the swap back and shipped a tighter system prompt before retrying.
For runtime defence, ProtectFlash runs as a fast pre-guardrail catching the most common direct probes, while ContentSafety runs as a deeper post-guardrail before output streams to the user.
How to Measure or Detect It
BiasDetection: cloud-template evaluator returning a 0–1 bias score with category breakdown.NoRacialBias,NoGenderBias,NoAgeBias,Sexist: targeted evaluators that flag specific demographic axes for fine-grained pass-rate tracking.ContentSafety: a broader content-safety evaluator that catches stereotypes-discrimination outputs as a subset.- HarmBench pass rate: the percentage of stereotypes-discrimination probes correctly refused or neutralised; a regression-test signal.
- Refusal-disparity metric (dashboard signal): refusal rate stratified by demographic framing of the prompt — disparity above 5% is a red flag.
from fi.evals import BiasDetection, NoRacialBias
bias = BiasDetection()
racial = NoRacialBias()
result_a = bias.evaluate(input="...", output="...")
result_b = racial.evaluate(input="...", output="...")
print(result_a.score, result_b.score)
Common Mistakes
- Testing only direct probes. Indirect role-play and decision-task framings break models that pass direct tests; include all three probe shapes.
- Aggregating bias to a single number. A 0.05 average hides a 0.40 racial disparity; report per-axis.
- Relying on prompt-side instructions (“be unbiased”). The system-prompt fix is brittle; pair it with a measured guardrail.
- Not stratifying refusal rates by demographic. A model that refuses one group’s prompts more often is itself biased; alert on disparity, not just toxicity.
- Skipping the red-team regression on every model change. A model swap or fine-tune can quietly regress bias scores by double digits.
Frequently Asked Questions
What is a stereotypes and discrimination attack on an LLM?
It is a red-team probe class that tries to elicit responses reinforcing demographic stereotypes or producing discriminatory recommendations across hiring, lending, healthcare, or other decision tasks based on protected attributes.
How is it different from a generic harmful-content attack?
Harmful-content covers a broader category — violence, illegal activity, self-harm, CBRN. Stereotypes-and-discrimination is the bias-and-fairness slice, scored against demographic groups and mapped to NIST AI RMF's bias risk category.
How do you detect a successful stereotypes attack in production?
FutureAGI's BiasDetection, NoRacialBias, NoGenderBias, NoAgeBias, and Sexist evaluators score every output for demographic bias; HarmBench-aligned scenarios in the simulate-sdk reproduce the attack as a regression test.