Models

What Is Threat Modeling for AI?

The structured practice of enumerating attack surfaces and adversarial scenarios specific to ML and LLM systems before deployment, with prioritized mitigations.

What Is Threat Modeling for AI?

Threat modeling for AI is the structured practice of enumerating attack surfaces and adversarial scenarios specific to ML and LLM systems before deployment. It extends classical STRIDE-style threat modeling with LLM-specific threats: direct and indirect prompt injection, data poisoning, model extraction, jailbreaks, training-data leakage, excessive agency in autonomous agents, and tool-call abuse. The deliverable is a prioritized risk register that maps each threat to a concrete mitigation — guardrails, eval suites, red-team tests, monitoring rules. FutureAGI sits in the detection-and-evaluation layer, turning threat-model entries into runtime checks.

Why It Matters in Production LLM and Agent Systems

Most LLM teams ship with classical security checks (auth, rate limits, input length caps) and assume that is enough. It is not. The OWASP LLM Top 10 catalogs threats traditional appsec misses entirely: indirect injection through retrieved documents, model-extraction via repeated API queries, supply-chain poisoning through fine-tuning data, sensitive-information disclosure through inference-time leakage. A team that has never threat-modeled its LLM stack does not know which of these applies.

Security engineers feel this when an external red-team writes a report listing five attack paths nobody had documented. Compliance leads feel it during regulatory readiness reviews — EU AI Act and SOC2 both ask for documented threat assessments. Application engineers feel it when a customer reports the agent leaked another customer’s data through cross-session memory, and the post-mortem reveals the threat was modelable but unmodeled.

For 2026 agent stacks the surface multiplies. Every tool the agent can call is an attack vector for indirect injection. Every retrieved document is a potential poison source. Every memory store is a potential cross-session leak. Every model-context-protocol server is a third-party trust boundary. Threat modeling agentic systems means enumerating all of these and tying each to a guardrail or eval — otherwise the agent ships with a handful of known unknowns.

How FutureAGI Handles Threat Modeling for AI

FutureAGI does not produce the threat-model document — that is a security-team activity using frameworks like MITRE ATLAS, OWASP LLM Top 10, or NIST AI RMF as a starting taxonomy. FutureAGI is the detection-and-evaluation layer that operationalizes the model: each threat in the register maps to a runtime check or a regression eval. The relevant fi.evals surfaces are PromptInjection (direct and indirect injection signals on every input), ProtectFlash (lightweight injection detection at the gateway boundary), ContentSafety and Toxicity (output-side filtering), IsCompliant and DataPrivacyCompliance (regulatory checks), and PII (sensitive-information disclosure).

A real workflow: a financial-services team threat-models a customer-facing agent. The register includes (a) prompt injection via uploaded documents, (b) PII leakage through cross-session memory, (c) excessive agency on payment tools, (d) indirect injection through retrieved knowledge-base entries. Each maps to a FutureAGI control: (a) PromptInjection evaluator on every retrieved chunk before it enters the prompt; (b) PII evaluator on every memory write; (c) Agent Command Center pre-guardrail policy that blocks payment-tool calls without human-in-the-loop confirmation; (d) ProtectFlash at the gateway plus a weekly ai-red-teaming simulation suite via LiveKitEngine. Each control writes its outcome to traceAI spans, so audits trace from threat to detection in one query.

Unlike a generic LLM firewall that ships a fixed ruleset, FutureAGI’s controls are evaluator-graded and regression-tested — the threat model becomes living infrastructure rather than a one-time PDF.

How to Measure or Detect It

Measure each threat’s coverage and detection rate independently:

  • PromptInjection: returns whether an input contains injection signals; the canonical detection metric for direct and indirect attacks.
  • ProtectFlash: lightweight injection check at the gateway; pair with PromptInjection for layered defense.
  • PII + DataPrivacyCompliance: detection rate for sensitive-information disclosure across inputs and outputs.
  • Red-team eval-fail-rate: percentage of simulated attacks that bypass guardrails — the canonical robustness metric.
  • Threat-coverage rate: percentage of threat-model entries with at least one production control or eval; should be 100% before deployment.

Minimal Python:

from fi.evals import PromptInjection, ProtectFlash

injection = PromptInjection()
flash = ProtectFlash()

result_a = injection.evaluate(input=user_input, context=retrieved_chunks)
result_b = flash.evaluate(input=user_input)
print(result_a.score, result_b.score)

Common Mistakes

  • Treating prompt injection as a single threat. Direct and indirect injection have different attack surfaces and need different controls; model and detect them separately.
  • Skipping the supply-chain section. Fine-tuning data, RAG corpus, MCP servers, and tool APIs are all third-party trust boundaries that need explicit threat-model entries.
  • Threat-modeling once. Every new tool, model, or integration changes the surface; rerun the model on each meaningful architecture change.
  • No mapping from threat to detection. A threat without a runtime check or eval is undetectable; tie every register entry to a FutureAGI evaluator or guardrail.
  • Confusing red-team with threat modeling. Threat modeling enumerates the surface; red-teaming validates whether known threats are actually catchable. Both are needed.

Frequently Asked Questions

What is threat modeling for AI?

Threat modeling for AI is the structured practice of enumerating attack surfaces and adversarial scenarios for ML and LLM systems before deployment, then mapping each threat to a prioritized mitigation.

How is AI threat modeling different from traditional STRIDE?

STRIDE covers spoofing, tampering, repudiation, information disclosure, denial-of-service, and elevation. AI threat modeling adds LLM-specific threats: prompt injection, data poisoning, model extraction, jailbreaks, and excessive agency.

How do you operationalize threat modeling outputs?

FutureAGI converts threat-model entries into runtime guardrails and eval suites — PromptInjection and ProtectFlash for injection threats, plus red-team scenarios run via LiveKitEngine simulations on a regression schedule.