How is threat modeling for AI different from AI red teaming?

Threat modeling is an architecture review that predicts attack paths before or during design. AI red teaming actively tests those paths with adversarial prompts, poisoned context, tools, and scenarios.

What Is Threat Modeling for AI? FutureAGI Guide (2026)

Q: What is threat modeling for AI?

Threat modeling for AI maps how LLM applications, RAG systems, and agents can be attacked. It identifies trust boundaries, attacker goals, likely failure modes, and measurable controls before production.

Q: How do you measure threat modeling for AI?

Measure it by linking each modeled threat to FutureAGI evaluators such as `PromptInjection`, `ProtectFlash`, and `ToolSelectionAccuracy`, then tracking eval-fail-rate-by-cohort and trace fields such as `agent.trajectory.step`.

What Is Threat Modeling for AI?

Threat modeling for AI is the security practice of mapping how an LLM, RAG pipeline, or agent can be attacked before it reaches production. It is an AI security review method for eval pipelines, traces, gateways, tools, memory, and datasets, where untrusted text can become model instructions or tool actions. A good threat model names trust boundaries, attacker goals, likely failure modes, measurable controls, and regression tests. FutureAGI connects those risks to evaluators such as PromptInjection, ProtectFlash, and ToolSelectionAccuracy.

Why it matters in production LLM/agent systems

Threat modeling is what prevents a harmless-looking architecture diagram from hiding the real incident path. A support agent may read a poisoned help-center article, treat it as higher-priority guidance, call a billing export tool, and place customer data into a ticket. A coding assistant may accept repository text as trusted instructions and run a shell command that was embedded in documentation. If those paths were never modeled, the first useful evidence arrives after the model has already acted.

The pain is shared. Developers see traces where the planner chose a tool that the user never requested. SREs see p99 latency rise because guardrails retry or block late in the request. Security teams see prompt leakage, PII leak, excessive agency, cross-session memory bleed, and suspicious llm.token_count.prompt growth. Compliance teams need proof that controls exist for each data class and tool permission, not just a spreadsheet that says “reviewed.”

Agentic systems make the review harder because every step creates a new trust boundary: user input, retrieved chunks, tool output, memory write, planner state, browser page, MCP connector, and final response. Unlike a generic STRIDE pass, AI threat modeling has to account for instruction hierarchy, stochastic outputs, retrieval context, and tool authority. The output should not be a static risk register; it should become eval cases, guardrail rules, trace queries, and regression thresholds.

How FutureAGI handles threat modeling for AI

FutureAGI handles threat modeling as a map from threat to measurable control. The eval:* anchor for this entry is the evaluator layer: PromptInjection checks whether untrusted text is trying to override instructions, ProtectFlash provides a lightweight prompt-injection screen for guardrail placement, and ToolSelectionAccuracy verifies that an agent chose the expected tool for the task. Teams often pair those with ActionSafety when the agent has write-capable tools.

Consider a financial-support agent using traceAI-langchain, a retrieval step, account lookup, refund creation, and case-note writing. The engineer models four attack paths: hostile retrieved content, user-requested data exfiltration, unsafe refund action, and memory contamination. In Agent Command Center, the route support-agent-secure runs ProtectFlash as a pre-guardrail before retrieved chunks enter the planner, then runs a post-guardrail before the response leaves the system. The trace keeps agent.trajectory.step, tool.name, tool.output, prompt version, and route decision around each step.

FutureAGI’s approach is evidence-first: the threat model is only useful when each threat has a trace query, evaluator, threshold, and owner. If PromptInjection flags a chunk from a partner wiki, the engineer quarantines that source, adds the trace to a regression dataset, and blocks release until the failing cohort clears. If ToolSelectionAccuracy drops on refund scenarios, the next action is permission tightening, a fallback path, or human approval before any write tool executes.

How to measure or detect it

Measure an AI threat model by checking whether each risk has a live signal and a response path:

PromptInjection — returns a prompt-injection risk signal for user input, retrieved content, and tool output; track fail rate by source and route.
ProtectFlash — provides a fast prompt-injection check for pre-guardrail placement when latency matters.
ToolSelectionAccuracy — checks whether the agent selected the expected tool; slice failures by agent.trajectory.step.
Trace signals — inspect llm.token_count.prompt, tool.output, prompt version, route name, memory write events, and guardrail decision.
Dashboard signals — watch eval-fail-rate-by-cohort, guardrail-block-rate, escalation-rate, token-cost-per-trace, and p99 latency after controls are added.

from fi.evals import PromptInjection, ProtectFlash

prompt_risk = PromptInjection().evaluate(input=external_text)
fast_risk = ProtectFlash().evaluate(input=external_text)
if prompt_risk.score >= 0.8 or fast_risk.score >= 0.8:
    decision = "block"

The goal is not a perfect score. It is traceable coverage: every modeled threat should connect to an evaluator, log field, guardrail action, owner, and regression test.

Common mistakes

AI threat models fail when they describe concerns but never bind them to runtime evidence. Watch for these patterns:

Modeling only the prompt. Retrieval, memory, tool responses, browser pages, and uploaded files can all carry instructions into the model context.
Treating OWASP LLM Top 10 as the model. OWASP is a checklist; your tool permissions and data flows create the real attack graph.
Skipping negative tool cases. A valid agent should sometimes refuse a tool, choose read-only access, or escalate before taking a write action.
Logging before classification. Raw traces can store secrets or PII unless redaction runs before durable storage.
Using one control for every threat. Prompt injection, excessive agency, PII leak, and denial-of-service need different thresholds and response actions.