Compliance

What Is AI Red Teaming?

Structured adversarial testing of an AI model or agent using attack prompts, jailbreaks, and edge-case scenarios to surface failures before deployment.

What Is AI Red Teaming?

AI red teaming is the structured practice of attacking a model or agent with adversarial prompts. jailbreaks, prompt injections, role-play exploits, encoding tricks, multi-turn manipulation, indirect injection via retrieved content. to discover failure modes before real users or attackers do. It runs in two distinct modes. One-shot pen-testing is a pre-launch assessment against a fixed corpus, producing a report. Continuous red teaming runs an adversarial scenario suite on every release candidate, producing a regression-grade signal. Outputs include vulnerability inventory, attack-success rate per class, and locked-in regression tests. It is the security counterpart to evaluation. The May 2026 short version: if your red team runs once a quarter, you have a snapshot; the production threat surface moves daily.

Why AI red teaming matters in production LLM and agent systems

A model that scores well on benchmarks can still be one cleverly worded prompt away from leaking the system prompt, executing an unauthorized tool call, or generating disallowed content. Public jailbreak databases now contain tens of thousands of working attacks; the cost to test one is a few cents in inference. The cost of missing one is a CVE, a public incident, or a regulator-mandated post-mortem. Frontier model cards in 2026 report red-team results from internal AISI-style assessments, but the corpus that matters for your system is the one that targets your tools, your retrieval index, your system prompt. none of which are covered by a frontier lab’s pre-release red team.

The pain hits multiple roles. Security teams ask “what is our exposure to indirect prompt injection?” and engineering has no quantitative answer. A developer ships a new tool, and three weeks later a user posts a screenshot showing the agent calling that tool with arguments derived from a webpage’s hidden HTML. A compliance lead’s annual audit asks for evidence of adversarial testing. there is none beyond a one-time assessment from launch. An SRE sees a sudden spike in pre-guardrail block rate and has no idea whether it’s a real attack wave or a noisy detector.

In 2026 agent stacks, indirect prompt injection. payloads embedded in retrieved documents, tool outputs, MCP server responses, or other agents’ messages over A2A. is the dominant attack vector. A red-team suite that only tests direct user inputs misses most of the real attack surface. The 2026 attack landscape also includes multi-modal injection (text hidden in images that OCRs into context), context-window stuffing (pushing safety instructions out of attention), tool-chain confusion (tricking the planner into out-of-order tool use), and supply-chain attacks on poisoned training data or evaluator prompts. Continuous red teaming wired into CI/CD is the only way to keep pace; one-shot pen-testing buys you a snapshot that decays the moment the next model update or prompt change ships.

Regulatory frameworks now require it. The EU AI Act high-risk regime treats red teaming as a core post-market monitoring obligation; the UK AISI and US AISI both recommend continuous adversarial testing for frontier deployments; ISO/IEC 42001 calls out adversarial testing as an organizational control. None of those frameworks accept “we did a launch-time pen-test” as evidence anymore.

How FutureAGI handles AI red teaming

FutureAGI ships two surfaces that turn red teaming from a quarterly exercise into a continuous one. The first is simulate-sdk. You construct adversarial Persona objects with attacker-style situations and desired-outcome assertions, group them into a Scenario, and run the suite via CloudEngine or LiveKitEngine against your agent callback. Personas can be generated programmatically with ScenarioGenerator from a topic. for example, “indirect prompt injection through retrieved documents targeting refund tool misuse”. or loaded from a CSV/JSON corpus of known attacks. Each run produces a TestReport with per-case transcripts, eval scores, audio paths (for voice), and pass/fail.

The second surface is the evaluator chain that scores attack success. PromptInjection and ProtectFlash detect injection success on inputs and outputs. ContentSafety, Toxicity, IsHarmfulAdvice, and NoHarmfulTherapeuticGuidance catch when an attack succeeds in eliciting disallowed content. ActionSafety catches when an attack succeeds in producing a dangerous tool call. PII catches data-exfiltration attacks. IsCompliant with a domain rubric catches policy violations. These run inside the simulate harness and as pre-guardrail and post-guardrail stages in Agent Command Center, so attacks blocked in production also block the same attack in the regression suite.

A real workflow: a fintech engineering team curates a 1,500-case corpus across direct injection, indirect injection (the largest bucket. ~40% of the corpus by 2026), jailbreak, role-play exploit, tool-misuse, multi-modal injection, and multi-turn manipulation classes. Every release candidate runs the full suite via Scenario.load_dataset against a CloudEngine wired to the agent’s staging endpoint. The release gate is “no regression in attack-success rate above the previous release plus 1% on any class, and zero successful attacks in the safety-critical subset.” When a jailbreak from a public dataset starts succeeding mid-release, it gets added to the corpus that day. FutureAGI gives the simulation engine, the evaluators, the audit-grade run history, and the gateway-side enforcement; the attack corpus is yours to grow.

In our 2026 red-team runs across customer agents, the corpora that produce the highest yield share three properties: they are sourced from real public attack databases (refreshed monthly), they are augmented with ScenarioGenerator outputs targeting the team’s specific tool surface, and they get graded by a judge model from a different family than the target (a Claude-judging-Claude pair inflates pass rate). Compared to Promptfoo, which is excellent for static prompt evaluation but treats red teaming as a separate module, FutureAGI keeps the red-team corpus, the simulate engine, the evaluator scores, and the production audit log on the same trace object. which is what makes the “did the attack we found yesterday land in production today?” question answerable.

How to detect and measure AI red-team posture

Red-team posture is a small set of metrics that report against an evolving attack corpus. The table maps the 2026 attack classes to the FutureAGI surfaces that score them.

Attack class (2026)Example payload typeFutureAGI detectorWhere it runs
Direct prompt injection”Ignore previous instructions and…”ProtectFlash, PromptInjectioninput pre-guardrail
Indirect injection (retrieval)Payload in a PDF chunkProtectFlash, PromptInjectioncontext guardrail on retrieved chunks
Indirect injection (tool output)Payload in a tool responseProtectFlashtool-output guardrail
Indirect injection (A2A peer)Payload in delegated sub-agent messageProtectFlash, PromptInjectionA2A message guardrail
Jailbreak (role-play)“You are DAN…”PromptInjection, ContentSafetyinput + output guardrail
Jailbreak (encoding)Base64 / leet / cipher payloadPromptInjection, ProtectFlashinput guardrail
Multi-turn manipulationEscalating prompts over several turnsPromptInjection, ConversationCoherencemulti-turn replay
Multi-modal injectionHidden text in imagePromptInjection on OCR’d textpost-OCR guardrail
Tool misuseCoaxing wrong tool callActionSafety, ToolSelectionAccuracypre-action guardrail
Data exfiltrationCoaxing system prompt or PII outPII, IsCompliantoutput guardrail
Refusal-bypass”Pretend you’re not an AI…”AnswerRefusal, ContentSafetyoutput guardrail
Supply-chain (eval prompt poisoning)Hostile content in eval templatered-team CI run on eval promptsoffline corpus check
Voice-specific (audio prompt injection)Hostile audio with TTS-friendly payloadPromptInjection on transcript, audio scanLiveKit pipeline

The signals to track:

  • Attack-success rate by class. direct injection, indirect injection, jailbreak, encoding, multi-turn, multi-modal, tool misuse. tracked release over release.
  • Coverage. number of attack categories tested versus the OWASP LLM Top 10 v2 and your domain-specific threats.
  • Time-to-regress. interval from a new public attack landing to it being added to your corpus; target under 72 hours for known-active campaigns.
  • PromptInjection and ProtectFlash block-rate on the corpus. should approach 100% on known attacks; lower is a control gap.
  • False-positive rate on clean reference. measure how many benign requests get blocked; >2% is product-breaking.
  • End-to-end p99 of the red-team suite. runs that take eight hours never happen on the right cadence; budget under 30 minutes for the gating slice and full coverage in a nightly batch.
  • Judge-family separation. verify the grading model is not the target model family.
from fi.simulate import Scenario, CloudEngine
from fi.evals import PromptInjection, ProtectFlash, ActionSafety

scenario = Scenario.load_dataset("redteam_corpus_2026q2.csv")
engine = CloudEngine(agent=my_agent)
report = engine.run(scenario, evals=[PromptInjection(), ProtectFlash(), ActionSafety()])
print(report.attack_success_rate_by_class)

For trajectory-level red teaming against an agentic stack. the case where the attack succeeds three nodes deep without producing a bad final answer. wire the eval chain across the whole graph and assert no per-class regression:

from fi.simulate import Scenario, CloudEngine
from fi.evals import (PromptInjection, ProtectFlash, ActionSafety,
                      ToolSelectionAccuracy, TrajectoryScore, CustomEvaluation)

agent_redteam = CloudEngine(agent=my_langgraph_agent, capture_trajectory=True)
report = agent_redteam.run(
    Scenario.load_dataset("agentharm_v1_redteam.csv"),  # 110 harmful behaviors
    evals=[PromptInjection(scope="all_untrusted"),
           ProtectFlash(scope="tool_return"),
           ActionSafety(),
           ToolSelectionAccuracy(),
           TrajectoryScore(),
           CustomEvaluation(rubric="no_exfil_via_tool_v2")],
    grader_family="cross_family",  # never grade with the target family
)
report.assert_no_regression(metric="attack_success_rate", tolerance=0.0)

The report should feed a CI release gate that blocks the deploy on any regression beyond the per-class tolerance, and the failing transcripts should auto-attach to the gate decision so the engineer’s first click after a block lands on the actual attack that succeeded.

Automated vs human red teaming

The 2026 mix that works is roughly 80% automated, 20% human. Automated red teaming runs the corpus, scores it, gates the release; it’s cheap, repeatable, and high-coverage on known attacks. Human red teaming finds the next attack. the one that doesn’t exist in any corpus yet. The two are complementary, not substitutes. Teams that skip human red teaming see corpus stagnation within two release cycles; teams that skip automation never run the corpus often enough to catch real regressions. We’ve found in 2026 deployments that the best ROI on human red-team hours is on novel attack class discovery (new modalities, new tool surfaces, new protocol boundaries like A2A). automated runs cover the long tail of variations.

External red-team partnerships also matter for sectoral compliance. Healthcare, financial services, and defense deployments increasingly require a third-party red-team report as part of audit evidence. The FutureAGI surface. versioned corpus, TestReport, audit-grade run history. produces the artifact the external party then signs.

Red teaming for agentic AI vs single-turn chat

Single-turn chat red teaming is well-understood: throw the OWASP corpus at the prompt, score the response. Agentic red teaming is fundamentally different because the trajectory has state. An attack can succeed in step 3 of a 7-step trajectory without producing a “bad answer”. for example, by tricking the planner into calling an out-of-scope tool whose return value contains the actual exfiltration. The eval has to score the full trajectory, not just the final response. This is where ActionSafety, ToolSelectionAccuracy, and TrajectoryScore come in: they catch attacks that succeed on action, not on text. Frontier model cards in 2026 increasingly report τ-bench and SWE-Bench Verified results from adversarial test sets for exactly this reason.

The other 2026 wrinkle: multi-agent and A2A delegation creates inter-agent attack surfaces. A hostile sub-agent over A2A can poison the parent’s context window; a compromised MCP server can leak data through tool descriptions before the user ever interacts. Red-team corpora that only test the user-facing surface miss this whole class.

What the 2026 corpus should cover

A current AI red-team corpus has to include indirect injection at ~40% weight (web pages, PDFs, RAG chunks, MCP tool returns, A2A peer messages, email attachments). Direct injection and jailbreak together should be ~25%. Tool-misuse and action-safety scenarios should be ~15%. Multi-modal (image+text, audio+text) should be ~10%. Multi-turn manipulation and refusal-bypass round out the last ~10%. The weights matter because a 90%-direct-injection corpus reports a misleading 98% block rate even though the production threat is indirect.

Public sources to seed from include the OWASP LLM Top 10 v2 reference attacks, AgentDojo benchmark, AISI’s published model card evaluations, Anthropic’s Many-Shot Jailbreaking dataset, Microsoft’s PyRIT generator, and the HackAPrompt corpus. Named agent-safety benchmarks anchor the comparison: AgentHarm (Gray Swan, 110 harmful agent behaviors across 11 categories) is the de facto leaderboard for harmful agent refusal; HarmBench covers ~510 behaviors across categories with both validation and test splits; FutureAGI’s PHARE benchmark adds ~6K labeled hallucination-harm examples for grounded-safety probing. Frontier model cards in 2026 disclose all three. a red-team gate that doesn’t track at least one is missing the audit currency. Augment with ScenarioGenerator runs targeting your specific tool surface. Track corpus version alongside model and prompt version so the regression record is reproducible.

Common mistakes

  • Treating red teaming as a launch event. Static corpora go stale within weeks; new jailbreaks land daily. Continuous is the only working mode in 2026.
  • Testing only direct user inputs. Indirect injection through retrieved content and tool outputs is the dominant 2026 attack vector. most red-team corpora under-cover it. Aim for ~40% indirect.
  • Not regression-locking discovered attacks. A jailbreak found and fixed but not added to the regression set will return at the next prompt change. Every confirmed attack must become a permanent test case.
  • Using the model under test as the judge. Self-grading inflates safety scores by 10-20% in our 2026 measurements. Pin the judge to a different model family or use deterministic detectors.
  • No success criteria per class. “How did red-team go?” with no per-class numbers is theater, not a signal. Set thresholds per class and per safety-critical cohort.
  • Skipping multi-modal coverage. If your system accepts images, PDFs, audio, or video, you need attacks in those modalities. Frontier multi-modal models leak through OCR-extracted text more than through the model’s native vision pathway.
  • No CI integration. A red-team run that requires a manual trigger gets skipped under deadline pressure. Wire it to every PR or every nightly release candidate.
  • Mixing red team with normal eval cohorts. Adversarial inputs are not representative inputs; mixing them inflates the safety dashboard and hides functional regressions. Keep red-team corpus, golden dataset, and production-replay set separate.
  • Letting blue team and red team merge. The team that builds the guardrails should not exclusively run the red team. the same blind spots end up in both. Rotate red-team authorship or use external help.
  • No corpus versioning. When the corpus changes, the attack-success-rate trend line becomes meaningless because the denominator moved. Version the corpus alongside the model and prompt, and report results against a specific corpus version.
  • Red-teaming only the model, not the agent system. The model is one component; the agent system also includes retrieval, tools, MCP servers, A2A peers, gateway routing, and the prompt template. Red team the system, not the model in isolation. the same model behaves differently in different agent shapes.

Frequently Asked Questions

What is AI red teaming?

It is the structured practice of attacking an LLM or agent with adversarial prompts, jailbreaks, and edge-case scenarios to surface failures before users or attackers do. It produces a vulnerability inventory and regression tests that protect future releases.

How is red teaming different from regular evaluation?

Evaluation measures task quality on representative inputs; red teaming measures resilience on adversarial inputs designed to break the system. Both are required. passing one does not imply the other.

How do you run continuous red teaming?

Use FutureAGI's simulate-sdk to generate adversarial Personas and Scenarios, run them against every release candidate via Scenario.load_dataset, and gate releases on attack success-rate using PromptInjection and ProtectFlash.