Compliance

What Are AI Guardrails?

Runtime rules that inspect LLM and agent traffic, then allow, block, redact, route, or escalate requests that violate configured policies.

What Are AI Guardrails?

AI guardrails are runtime policies that sit on top of LLM and agent traffic and decide what passes through. Each guardrail combines a detector — a fi.evals evaluator like PromptInjection or PII — with an action: allow, block, redact, route to a smaller model, fall back to a canned response, or escalate to a human reviewer. Guardrails run at well-defined boundaries: before the model call, after the response, around tool outputs, and over retrieved context. In FutureAGI, they are configured as pre-guardrail and post-guardrail chains inside Agent Command Center.

Why It Matters in Production LLM and Agent Systems

A model that is safe in evals is not automatically safe in production. The traffic that hits a live system carries adversarial prompts, hostile retrieved context, and tool outputs that drift weekly. Without guardrails, a prompt-injection payload pasted into chat reaches the planner. A retrieved chunk from a poisoned web page becomes context. A tool returns PII that gets copied verbatim into the final answer.

The pain shows up across roles. An ML engineer ships a new prompt and sees a 4% spike in policy violations the next day. An SRE watches p99 latency double after a noisy detector is added without a budget. A compliance lead is asked, mid-audit, “show me the request, the policy, the detector, the action, and the reviewer for this blocked event” and has nothing to surface. End users either hit false positives that look broken or hit silent leaks that look fine.

In 2026 agent stacks, the pressure compounds. A single user request can fan out into a planner, three tool calls, an MCP server hop, and a critique pass. A single moderation endpoint at the final response is too late — the harmful instruction already entered the loop at step two. Guardrails have to live at every transition where untrusted text becomes model context or where model output becomes an external action.

How FutureAGI Handles AI Guardrails

FutureAGI’s approach is to treat guardrails as composable runtime policies wired into Agent Command Center routes. Each route — say support-refund-agent — declares a pre-guardrail chain that runs before the upstream LLM call and a post-guardrail chain that runs after the response. Detectors come from the fi.evals library: ProtectFlash for low-latency prompt-injection screening, PromptInjection for deeper checks on suspicious content, PII for personal-data detection, ContentSafety for harmful output, and JSONValidation for schema enforcement.

Concretely: an engineer attaches a pre-guardrail chain of ProtectFlash plus PII to the route. Incoming user text and retrieved chunks (instrumented with traceAI-langchain) flow through the chain. If PII fires on a retrieved chunk, the gateway redacts the matching span, logs the source URL and policy version, and continues. If ProtectFlash fires on the user prompt, the gateway routes to a fallback response and emits an audit event with the request ID, the evaluator score, and the action taken. Post-response, ContentSafety and IsHarmfulAdvice run before the answer leaves the gateway.

Compared with a single-shot moderation filter at the chat-input edge — the pattern most LLM Guard-style libraries default to — this catches the RAG, browser, email, and tool-output cases where the user’s first message looked harmless. We’ve found that block-rate alone is a misleading metric; engineers should pair it with reviewer-sampled false-positive rate and p99 added latency per route.

How to Measure or Detect It

Treat guardrails as a runtime control system, not a one-time test:

  • ProtectFlash block-rate — low-latency prompt-injection screening on pre-guardrail paths.
  • PromptInjection failure rate — deeper signal for suspicious prompts, retrieved chunks, and tool outputs.
  • PII and ContentSafety fire-rate — privacy and safety failures per 1K requests, sliced by route, model, and prompt version.
  • Operational cost — added p99 latency, token-cost-per-trace, fallback rate, human-escalation rate.
  • Evidence quality — every block carries request ID, policy version, evaluator result, route action, reviewer outcome.
from fi.evals import ProtectFlash, PromptInjection, PII

checks = [ProtectFlash(), PromptInjection(), PII()]
results = [c.evaluate(input=request_text) for c in checks]
if any(r.score == "Failed" for r in results):
    decision = "block"

Also measure the negative space. A guardrail blocking 12% of traffic is more likely broken than vigilant. Sample blocks weekly, compute false-positive rate, and replay against a golden dataset of known prompt-injection, PII, and harmful-content cases.

Common Mistakes

  • Scanning only chat input. RAG chunks, browser text, email bodies, and tool outputs carry the risky payload more often than the first user message.
  • Blocking without preserving evidence. Incident review needs source, trace ID, evaluator result, policy version, and route action — log it the moment you block.
  • Letting write tools run before checks. Pre-guardrails should fire before external actions, not after the tool has already mutated state.
  • Ignoring false positives. A noisy guardrail is bypassed by product teams within a sprint, even if its security intent is correct.
  • Treating one moderation endpoint as a guardrail system. A real guardrail layer needs route policy, detector chains, ordered actions, and audit records — not one API call.

Frequently Asked Questions

What are AI guardrails?

AI guardrails are runtime policies for LLM and agent traffic. Each combines a detector (e.g., PromptInjection, PII, ContentSafety) with an action — block, redact, route, log, or escalate — applied before or after the model call.

How are AI guardrails different from an AI firewall?

A guardrail is one runtime check. An AI firewall is the gateway-level control plane that chains many guardrails, detectors, routes, and audit logs into a coordinated policy boundary.

How do you measure AI guardrails?

Track block-rate, false-positive rate, p99 added latency, and audit completeness per route. FutureAGI evaluators like ProtectFlash, PromptInjection, PII, and ContentSafety produce the underlying signal.