Security

What Is Agent Hijacking?

A failure mode where hostile context or attacker instructions take control of an AI agent's plan, tools, or memory.

What Is Agent Hijacking?

Agent hijacking is an AI security failure where hostile instructions or untrusted context take control of an agent’s plan, tools, or memory. It is an agent security failure mode that appears in eval pipelines, production traces, RAG retrieval, tool calls, and gateway guardrails. The practical signal is not just a bad answer; it is an unsafe next action. FutureAGI measures the risk with ActionSafety, ToolSelectionAccuracy, and trace evidence for each agent.trajectory.step.

Why it matters in production LLM/agent systems

Agent hijacking turns a model instruction error into attacker-directed work. A support agent can be pushed from “summarize this ticket” into exporting customer data. A browser agent can read a poisoned page and begin following the page’s hidden checklist. A coding agent can accept hostile repository text as task authority, then run a tool or edit a file the user never asked for.

Two failures usually define the incident. Control-plane takeover happens when the planner starts treating attacker text as a higher-priority instruction than the system policy or user request. Unsafe tool execution happens when that takeover reaches a write tool, external API, email sender, payment flow, ticketing system, memory store, or MCP server. Prompt injection may start the chain, but hijacking is the moment the agent’s control path changes owner.

The pain is visible across teams. Developers see strange tool choices, unexplained planner jumps, and traces where the final answer sounds normal while an earlier step already acted. SREs see abnormal tool-call fan-out, longer trajectories, p99 latency spikes from retries, and token-cost-per-trace increases. Security teams need proof of source text, tool arguments, approval state, and policy verdict. Users feel it as actions they did not request.

This matters more in 2026 multi-step pipelines because agents read, remember, retrieve, browse, delegate, and execute. Every read boundary can become an instruction boundary. Every write boundary can become blast radius.

How FutureAGI handles agent hijacking

FutureAGI anchors agent-hijacking review on eval:ActionSafety: the ActionSafety evaluator checks whether a proposed agent action is safe for the user intent, policy context, and current trajectory. Teams usually pair it with ToolSelectionAccuracy for wrong-tool choice and PromptInjection for hostile instructions entering through prompts, RAG chunks, tool output, browser pages, or MCP responses.

A real workflow looks like this. A LangChain account-support agent is instrumented with traceAI-langchain. The user asks for a summary of open invoices, but a retrieved note says to ignore policy and email a full account export. The trace records agent.trajectory.step, tool.name, tool.arguments, source chunk id, prompt version, and route. Before the proposed send_email or crm.export call executes, Agent Command Center applies a post-guardrail that runs ActionSafety. If the action conflicts with the user’s intent or policy, the route blocks execution, emits a guardrail decision on the trace, and returns a fallback that asks for confirmation or human review.

FutureAGI’s approach is action-level, not answer-level. Unlike Ragas faithfulness-style checks, which mostly ask whether generated text matches retrieved context, agent hijacking needs evidence about authority, tool choice, arguments, policy approval, and side effects. The engineer’s next move is concrete: add the blocked trace to a regression dataset, replay sibling traces with the same source corpus, tighten tool permissions, and fail release candidates when unsafe-action rate or post-guardrail block rate exceeds the reviewed threshold.

How to measure or detect it

Measure hijacking where control changes, not only where text is generated:

  • ActionSafety — returns a safety judgment for a proposed action against user intent, policy, and trajectory context.
  • ToolSelectionAccuracy — catches cases where the agent chose a powerful or irrelevant tool even if the final response looked acceptable.
  • PromptInjection — flags hostile instructions entering through user prompts, retrieved content, tool output, and browser context.
  • Trace fields — inspect agent.trajectory.step, tool.name, tool.arguments, source chunk id, prompt version, approval state, and route decision.
  • Dashboard signals — track unsafe-action rate, post-guardrail-block-rate, privileged-tool-call rate, tool-call fan-out, and escalation rate by cohort.
from fi.evals import ActionSafety

evaluator = ActionSafety()
result = evaluator.evaluate(
    input="User asks only for a summary of open invoices.",
    output='{"tool": "send_email", "args": {"attachment": "account_export.csv"}}'
)
print(result.score, result.reason)

Alert on sharp deltas by connector, customer, prompt version, and model route. One hijacked write action can matter more than hundreds of blocked low-risk chat attempts.

Common mistakes

Most agent hijacking incidents come from treating control as a prompt preference instead of an enforceable boundary.

  • Scanning only the user message. Hijacking often enters through retrieved pages, files, memory, tool output, or MCP server responses.
  • Letting content rewrite policy. Untrusted text can inform the answer, but it must not define tool permissions or approval rules.
  • Scoring only the final answer. A polite answer can hide an unsafe tool call that already executed two steps earlier.
  • Giving memory writes low scrutiny. Poisoned memory can persist the hijack across future sessions and users.
  • Blocking without preserving trace evidence. Incident review needs source span, tool arguments, evaluator result, route, and policy verdict.

Frequently Asked Questions

What is agent hijacking?

Agent hijacking is an AI security failure where hostile context takes control of an agent's plan, tools, or memory and pushes it toward attacker-chosen actions.

How is agent hijacking different from prompt injection?

Prompt injection is often the attack vector. Agent hijacking is the resulting control failure: the agent follows hostile instructions, chooses unsafe tools, or mutates memory outside the user's intent.

How do you measure agent hijacking?

Use FutureAGI's ActionSafety on proposed actions, ToolSelectionAccuracy on tool choice, and PromptInjection at context boundaries. Track post-guardrail blocks by route, tool, and trace.