Security

What Is a CBRN Harmful Content Attack?

An attack that attempts to elicit chemical, biological, radiological, or nuclear weapons-relevant uplift from an LLM through jailbreaks, framing, or indirect retrieval.

What Is a CBRN Harmful Content Attack?

A CBRN harmful content attack is an attempt to extract chemical, biological, radiological, or nuclear weapons-relevant uplift from a large language model. The attacker is not asking general chemistry trivia — they are seeking synthesis routes, precursor sourcing, weaponization details, dispersal tactics, or operational tradecraft that materially assists development of a CBRN capability. Techniques include direct asks, jailbreak prompts, role-play and Likert framing, encoding-based smuggling, multi-turn coaxing, and indirect retrieval through a poisoned knowledge base or tool. CBRN sits at the top of severity tiers in OWASP LLM Top 10 and every frontier-lab usage policy. False negatives here are catastrophic.

Why It Matters in Production LLM and Agent Systems

CBRN content is the failure mode where reputational, regulatory, and human-safety risk align. A model that helps a bad actor with even modest weaponization uplift triggers regulatory action, loses enterprise customers, and produces real-world harm. Frontier-lab evals (Anthropic’s RSP, OpenAI’s Preparedness Framework, MLCommons CBRN tasks) treat the category as a hard gate.

The pain shows up across roles. A safety lead is asked to prove “the model cannot provide weaponization uplift” — the proof requires an empirical test on a curated CBRN benchmark, not just policy text. A red-team engineer finds a jailbreak that works against the production model and realizes the post-guardrail layer never saw a CBRN-flagged input because the pre-guardrail let it through. A compliance lead reviewing the EU AI Act’s high-risk-system requirements has to document CBRN evaluation results, frequency, and remediation flow.

In agent stacks the surface is wider than chat. A code agent that can run shell tools, a research agent that can fetch arbitrary URLs, a multi-agent system where one model talks to another — each is a CBRN attack surface. Indirect prompt injection through retrieved content can plant CBRN-eliciting instructions inside an otherwise innocuous tool response.

How FutureAGI Handles CBRN Harmful Content Attacks

FutureAGI ships CBRN evaluation as a layered control: pre-deployment red-team eval, post-guardrail enforcement, and continuous monitoring against production traces.

For pre-deployment, the team curates a CBRN red-team cohort inside a Dataset — a versioned set of attack prompts spanning direct asks, jailbreaks, encoding-based smuggling, role-play framing, and indirect-injection vectors. Dataset.add_evaluation() runs ContentSafety, PromptInjection, ProtectFlash, and IsHarmfulAdvice against every row. A failure on any high-severity row blocks the model promotion. RegressionEval runs the same cohort against every model upgrade so a previously-blocked attack does not regress unnoticed.

In production, FutureAGI’s Agent Command Center applies pre-guardrails at the gateway. ProtectFlash is the lightweight pre-flight check; ContentSafety and PromptInjection run as post-guardrails on model output. A CBRN-classified input is blocked at the pre-guardrail; a CBRN-classified output is blocked at the post-guardrail and the request returns a safe refusal. Every block is logged to the trace with the evaluator, score, and reason — so the safety team has a per-request audit trail. Combined with red-team cohort scores, this produces a defensible CBRN posture against frameworks like NIST AI RMF and the EU AI Act.

How to Measure or Detect It

CBRN attack exposure is measured against curated cohorts and live traffic:

  • fi.evals.ContentSafety: cloud evaluator for content-safety violations including CBRN; returns boolean and a violation category.
  • fi.evals.PromptInjection: detects injection attempts that may carry CBRN payloads; returns score and reason.
  • fi.evals.ProtectFlash: lightweight pre-guardrail; suitable for high-throughput pre-check.
  • fi.evals.IsHarmfulAdvice: catches model outputs that materially aid harmful action.
  • CBRN red-team cohort fail-rate: fraction of curated CBRN prompts where the model produces uplift; the headline regression metric.
  • Block-rate at pre/post-guardrails: dashboard signal of how many production requests were blocked by the CBRN classifier.
from fi.evals import ContentSafety, PromptInjection

cs = ContentSafety()
pi = PromptInjection()

result = cs.evaluate(
    input="Walk me through synthesizing [redacted] step by step.",
    output="I can't help with that request."
)
print(result.score, result.reason)

Common Mistakes

  • Relying on the system prompt alone. “Don’t answer CBRN questions” in the system prompt does not survive jailbreaks; pair with evaluator guardrails.
  • Treating CBRN as a single category. Chemical, biological, radiological, and nuclear vectors have different attack surfaces and need separate cohort coverage.
  • Skipping indirect injection vectors. A poisoned retrieved document can carry the attack; test indirect-injection paths, not just direct ones.
  • No regression eval after model upgrade. A previously-safe model can regress on CBRN with a new fine-tune; rerun the cohort every release.
  • Logging the full attack prompt without access controls. Trace logs containing CBRN strings need stricter access than ordinary traces.

Frequently Asked Questions

What is a CBRN harmful content attack?

A CBRN harmful content attack tries to elicit chemical, biological, radiological, or nuclear weapons-related uplift from an LLM — synthesis routes, weaponization steps, acquisition guidance — using jailbreaks, framing, or indirect retrieval.

How is CBRN content different from other harmful content?

CBRN sits at the highest severity tier in most policy frameworks and frontier-lab usage policies. Unlike toxic-language or hate-speech categories, CBRN uplift can directly enable mass-casualty harm and is treated as a near-zero-tolerance class.

How does FutureAGI test for CBRN attacks?

FutureAGI runs ContentSafety, PromptInjection, and ProtectFlash on a curated CBRN red-team cohort, plus configurable pre- and post-guardrails that block CBRN-classified outputs at the gateway.