Guides

The Ultimate Guide to LLM Guardrails (2026)

A senior-engineer guide to LLM guardrails: placement, the 9 open-weight and 4 API backends, latency budgets, ensembles, and the precision/recall split that actually catches harm.

·
14 min read
llm-guardrails ai-safety prompt-injection protect ai-evaluation 2026
Editorial cover image for The Ultimate Guide to LLM Guardrails
Table of Contents

Most teams that ask “which guardrail model should I pick” are asking the wrong question. The model is the easy part. The harder questions are where the rail sits, how fast it decides, what it blocks, what it lets through, and how you measure the answer when the two failure modes look nothing alike.

A guardrail is a runtime control, not an eval. It enforces policy on live traffic in milliseconds, and it fails two ways: too loose (harmful content passes) and too strict (legitimate requests get blocked). A single F1 on a mixed test set hides that asymmetry, which is how teams ship guardrails that look fine in the lab and block one in fifteen real customers in production. This guide walks placement, the vendor landscape, the latency budget, the precision/recall split, and the architecture that makes inline enforcement land at scale.

TL;DR

DecisionThe right framing
PlacementFour rails (input, output, retrieval, tool-call), not three
BackendOpen-weight inline for the hot path, API for adjudication, ensemble for highest stakes
ScoringPrecision AND recall on adversarial AND benign sets, per category. Never a single F1
LatencySub-100ms p50 input, sub-150ms p50 output, or the guardrail becomes the product’s latency story
ArchitectureGateway plugin for org policy, SDK for app-specific rails, both shipping the same audit trail

The vendor landscape is crowded and the marketing pages all look the same. The differences show up in three places: how each model handles non-English traffic, how it handles benign edge cases, and how the system around it exposes per-tenant policy and audit. Pick on those axes.

Guardrails are runtime controls, not evals

Engineers who came up through ML score everything with an F1 and ship the model with the highest number. That works for an offline benchmark, not for a runtime control. An eval scores aggregate behavior on a labeled set; a guardrail makes a binary call on every request, where a wrong call is a customer-visible failure. An eval can wait minutes; a guardrail has milliseconds. An eval fails one way (the score drops); a guardrail fails two ways: a missed positive becomes an incident, a false positive becomes a churn metric.

Guardrails are policy enforcers with three real questions: what gets blocked, what passes, and how fast does the decision land. The eval is a tool for tuning those answers, not a substitute for asking them. Mature stacks share rubrics. The ai-evaluation SDK ships the same adapters as offline templates and as runtime classifiers, so production policy and the regression suite stay in sync. The scoring is still different, and pretending otherwise is the first mistake.

The four placement points

Most guides list three rails (input, output, retrieval). Three is the floor, not the ceiling.

Input rail. Runs on the user message before the model call. Catches jailbreaks, prompt injection, PII probes, malicious URLs, code-injection payloads, secrets leakage, topic violations. First line of defense, lowest latency budget.

Output rail. Runs on the model response before the gateway returns it. Catches toxic content, hallucinated medical or legal advice, leaked secrets in completions, off-topic responses, and the policy violations the model produced despite the input being clean. Higher budget than input because the user is already waiting.

Retrieval rail. Runs on retrieved chunks before they enter the prompt. Catches poisoned documents: the malicious instructions an attacker hid inside a crawled webpage, a support article, or a vector store entry that the model would otherwise treat as trusted context. Skipping this rail is how indirect prompt injection lands. RAG without a retrieval rail is open by default.

Tool-call rail. Runs on the arguments the model wants to pass to a tool before the tool executes. Catches secret exfiltration via tool args, SQL injection (the model emitting DROP TABLE), destructive shell commands, and prompt-injection success patterns that only show up at the tool boundary. Most “the agent did something it shouldn’t” incidents trace to a missing or weak tool-call rail.

Production stacks run all four with per-rail policies: same engine, different rubrics, different latency budgets, different blast radii. A pure input rail misses indirect injection from retrieved content; a pure output rail misses the tool call that already executed.

The vendor landscape: open-weight

The open-weight guardrail families are the inline-enforcement workhorses. The five that show up in production:

  • Llama Guard 3 (Meta, 8B / 1B). Strongest single-model option on multi-category content moderation, clean MLCommons-aligned taxonomy. The 8B is the quality pick, the 1B is the latency pick. Weak on prompt injection relative to Granite. LLAMAGUARD_3_8B, LLAMAGUARD_3_1B.
  • Qwen3-Guard (Alibaba, 8B/4B/0.6B). Multilingual coverage across 119 languages. Right pick when traffic is not English-dominant; 0.6B is the cheapest classifier in the stack and works well as a fast pre-filter. QWEN3GUARD_8B, QWEN3GUARD_4B, QWEN3GUARD_0.6B.
  • Granite Guardian (IBM, 8B / 5B). Leads on prompt-injection and hallucination categories. The 5B is the balanced default for budget-constrained deployments. GRANITE_GUARDIAN_8B, GRANITE_GUARDIAN_5B.
  • WildGuard (AI2, 7B). Highest precision on benign sets in our internal scoring: the model that over-blocks least on real customer support traffic. Trained on the WildGuardMix corpus (87K adversarial and benign). WILDGUARD_7B.
  • ShieldGemma (Google, 2B). Latency play when 8B is too expensive for the p50 budget. Pairs well as a fast pre-filter ahead of Llama Guard or Granite. SHIELDGEMMA_2B.

No single open-weight model is the answer. Llama Guard misses prompt injection patterns Granite catches; Granite over-blocks benign queries WildGuard handles cleanly; Qwen3-Guard outscores all of them on Hindi or Vietnamese. The right pattern is two models with non-overlapping strengths, ensembled with ANY for high-stakes categories, single-model for routine traffic.

The vendor landscape: API and commercial

  • OpenAI Moderation, Azure Content Safety. Hyperscaler endpoints, free or near-free, managed updates, network hop. Right for prototypes and as the second model in an ensemble. Not enough on their own for a regulated workload. OPENAI_MODERATION, AZURE_CONTENT_SAFETY.
  • Lakera Guard. Commercial inline classifier with prompt-injection and PII focus. Strong dashboards, clean API. Closed evaluation set is the limiting factor for teams that want to score it on their own corpus.
  • Guardrails AI. Open-source Python library plus a hub of community validators. Right pick when you want validator composition without writing the orchestration layer; validator quality is uneven (audit before adopting).
  • NeMo Guardrails (NVIDIA). Rail system built around Colang, a DSL for conversation flow. Good for complex multi-turn policy logic, heavier than a single classifier call. Right pick when the policy is the thing and your team is willing to learn a DSL.
  • Future AGI Turing Flash and Turing Safety. Future AGI’s hosted classifiers, exposed via the SDK alongside the open-weight models. Same Guardrails API, different backend. TURING_FLASH, TURING_SAFETY.

The commercial layer earns its keep on three axes: policy management (per-tenant rules without code changes), audit (the SOC 2 reviewer asks what blocked the output and why, and the answer is in response headers), and deployment (a gateway that swaps backends without redeploying applications). The classifier is one piece of a system.

Latency is a feature

A guardrail at 400ms p50 becomes the latency story for the whole product. The team disables it, the postmortem cites “operational impact,” the model never goes back on.

The budget that works:

  • Input rail: under 100ms p50, under 150ms p99.
  • Output rail: under 150ms p50, under 250ms p99.
  • Retrieval rail: budget folds into the retrieval call itself; under 80ms p50.
  • Tool-call rail: under 100ms p50, since it blocks an outbound side effect.

Three architecture moves keep the numbers honest. First, scanners before classifiers: the 8 sub-10ms scanners in the SDK (JailbreakScanner, CodeInjectionScanner, SecretsScanner, MaliciousURLScanner, InvisibleCharScanner, LanguageScanner, TopicRestrictionScanner, RegexScanner) catch the deterministic patterns regex would have caught anyway, at single-digit ms. Don’t spend ML budget on patterns regex would catch. Second, small fast classifier ahead of large: ShieldGemma 2B or Qwen3-Guard 0.6B as a pre-filter, larger model only on the chunks the small model flagged. Third, parallel ensemble, not sequential: when you stack two models, run them concurrently and fail-fast on the first block. The latency is the slowest model, not the sum.

Future AGI Protect runs at 65ms text and 107ms image median time-to-label per arXiv 2510.13351. That is the budget that makes inline enforcement on every request economical. Above 200ms p50, the rail loses political support inside the product team.

The precision/recall split

The most common guardrail engineering mistake: scoring with a single F1 on a mixed test set.

A guardrail fails two ways. Too loose, harmful content passes (low recall on the adversarial set). Too strict, legitimate requests get blocked (low precision on the benign set). A mixed F1 averages the two failures into one number that hides which one your model has.

The right scoring matrix:

SetMetricBar
Adversarial (jailbreaks, prompt injection, harmful)Recall per category0.95+ for safety-critical, 0.90+ for routine
Benign (real customer traffic)Precision per category0.99+ on the benign set, never below 0.97
BothF1 per categoryReported, not optimized for

Why per category? Because aggregate scores hide the categories where the model collapses. A classifier with 0.96 average recall might score 0.85 on prompt-injection-via-encoding while scoring 0.99 on toxicity. The aggregate looks fine. The encoding attacks land.

Two test sets you need to maintain, both of them yours, not the vendor’s:

  • Adversarial set. 500 to 2,000 examples across your categories: jailbreaks from JailbreakBench, prompt injections from PromptInject and Garak, domain-specific attacks you collected from production. Score recall per category.
  • Benign set. 2,000 to 10,000 real customer queries, sampled from production traffic, manually verified as legitimate. Score precision per category. This is the set vendors don’t ship and that determines whether your model is usable.

The ai-evaluation SDK ships the templates and the test-case primitives to wire both sets into CI. The gate fails the build when recall drops on adversarial, when precision drops on benign, or when either drops by more than 2 percentage points from the prior model version. Two thresholds, two sets, not one number.

The eval suite for guardrails

A guardrail is a runtime control, but it still gets evaluated, just with a different rubric. The eval suite that catches what the runtime can’t surface:

from fi.evals import Guardrails, RailType, AggregationStrategy
from fi.evals.guardrails import LLAMAGUARD_3_8B, GRANITE_GUARDIAN_8B

guard = Guardrails(rails=[{
    "rail_type": RailType.INPUT,
    "models": [LLAMAGUARD_3_8B, GRANITE_GUARDIAN_8B],
    "strategy": AggregationStrategy.ANY,
}])

# Adversarial set: expect block (track recall per category)
for case in adversarial_cases:
    if not guard.check_input(case.input).blocked:
        recall_failures.append((case.id, case.category))

# Benign set: expect allow (track precision per category)
for case in benign_cases:
    if guard.check_input(case.input).blocked:
        precision_failures.append((case.id, case.category))

Gate CI below threshold on either set. Re-run when you change the model, threshold, aggregation strategy, or prompt. The gate is the only place where a vendor swap surfaces as a real engineering decision instead of a config change someone forgot to test. For high-volume continuous scoring against production traffic, classifier-backed evaluators on the Future AGI Platform beat per-call LLM-judge economics, which is what makes “score every block in production” affordable.

The 4 aggregation strategies

When you stack multiple guardrails on the same input, the SDK exposes four ways to combine verdicts. ANY blocks if any model says unsafe (highest recall, more false positives; right for PII, jailbreak, CSAM). ALL blocks only if every model says unsafe (highest precision, more false negatives; right for borderline categories where over-blocking damages the product). MAJORITY blocks if most models say unsafe (balanced; right default when you have three or more models with comparable quality). WEIGHTED blocks if the weighted score exceeds threshold (tunable per category; right when one model is meaningfully more trusted on a particular axis).

Available as AggregationStrategy.ANY/ALL/MAJORITY/WEIGHTED. The strategy interacts with the scoring matrix above: ANY raises recall and drops precision, ALL does the inverse. Pick the strategy from your precision/recall numbers, not from a default.

Per-tenant policy, per-rail config

Multi-tenant products need per-tenant guardrail policies. Healthcare customers need HIPAA-grade PII rules; coding-assistant customers need aggressive secrets detection; children’s products need aggressive content moderation. One policy does not fit all tenants.

Three patterns. Per-tenant rail configs in the SDK: the Guardrails config is per-instance, instantiated per tenant or per request. Per-tenant pipeline mode in Protect: the gateway plugin runs adapters parallel (fail-fast concurrent) or sequential (early-rejection short-circuit) per tenant, plus per-tenant fail_open, per-check confidence threshold (default 0.8), and per-check action (block, warn, mask, log). Per-VK policy at the Agent Command Center: virtual keys carry policy bundles (rails, backends, aggregation, budgets) and application code never touches policy logic. Hot-swap a policy and every request after the swap honors the new rules without redeploying downstream.

Two architectures, both shipping

Inline guardrails live in one of two patterns, and most production stacks run both. SDK in-process runs in the application process via the ai-evaluation SDK. Pro: no extra network hop, no extra deployment surface. Con: tied to the application’s release cycle, harder to enforce org-wide policy, swaps require code changes. Gateway plugin runs in the API gateway between the application and the model provider. Pro: enforced on every request regardless of application code, swap-able without redeploying applications, central policy management, audit headers land in the trace tree. Con: extra deployment surface, slightly higher latency from the gateway hop.

Future AGI’s Agent Command Center is the gateway side: 33 guardrail scanners (18 built-in plus 15 external), 5-level hierarchical budgets (org / team / user / key / tag), per-tenant policy pipelines, and audit headers (x-agentcc-cost, x-agentcc-latency-ms, x-agentcc-model-used, x-agentcc-fallback-used, x-agentcc-routing-strategy, x-agentcc-guardrail-triggered) on every response. Run SDK for app-specific custom rails (domain restrictions, your own regex set), gateway for org-wide policy (PII, prompt injection, CSAM). The split lets each layer evolve at its own cadence.

Future AGI Protect: the runtime ML guardrail

Protect is Future AGI’s runtime guardrail model family. Four fine-tuned Gemma 3n LoRA adapters cover toxicity (content moderation), bias_detection (sexism, discrimination, stereotypes), prompt_injection (jailbreak, adversarial manipulation, system-prompt extraction), and data_privacy_compliance (PII detection, GDPR / HIPAA violations). A separate Protect Flash binary classifier handles the cheapest fast-path harmful/safe check.

Real numbers from the arXiv paper: median time-to-label of 65 ms text and 107 ms image. Native multi-modal (text, image, audio in a single model family).

Honest architecture note: Protect’s adapter weights are closed. The ML hop runs from api.futureagi.com/sdk/api/v1/eval/; the agentcc-gateway Go plugin self-hosts in your VPC and carries deterministic regex and lexicon fallbacks (18 PII entity types, 6 prompt-injection pattern categories) for zero-AI-credit usage and low-latency local rejection. The gateway is yours; the ML hop is hosted. Enterprise customers can deploy private vLLM instances under license; the weights themselves stay closed. If air-gapped open-weight guardrails are the requirement, the SDK ships nine of them.

The same adapters run offline as evaluation rubrics, so the production policy and the offline eval rubric stay in sync. Score what you block.

Streaming and the output rail

You cannot wait for the full output before scoring. Token-prefix scoring runs the check every N tokens and aborts the stream on failure (right default for PII and secrets). Post-chunk scoring runs per delta-chunk (right default for toxicity and off-topic). The SDK’s GuardrailProtectWrapper and the gateway’s StreamGuardrailChecker both support check_interval inspection with stop (cut the stream) or disclaimer (append warning) failure actions. A DROP TABLE payload or a script tag mid-stream gets caught before the downstream parser sees it.

Audit and compliance

Compliance audits ask “what blocked this output and why,” and your guardrail has to answer in milliseconds and log the answer for forensics. Three things to log per request: which rails ran (input / output / retrieval / tool-call and the order), which models scored (backend ID, version, score, threshold), and which verdict won (allowed / blocked / redacted with the aggregation strategy and the contributing scores).

The Agent Command Center ships the audit headers on every response, so the trail lives in the trace tree via traceAI (Apache 2.0, 50+ AI surfaces across Python, TypeScript, Java, C#, with a GUARDRAIL span kind and pluggable semantic conventions). The SOC 2 reviewer reads the trace, not a screenshot. The platform is SOC 2 Type II, HIPAA, GDPR, and CCPA certified.

Common guardrail mistakes

  • One backend everywhere. Llama Guard for content moderation, Granite for prompt injection, Qwen3-Guard for non-English. One model is a starting point, not a stack.
  • No scanners. Spending ML budget on patterns regex would catch in 2ms.
  • No retrieval or tool-call rail. RAG without a retrieval guardrail is open to indirect prompt injection; agents without a tool-call rail are open to side-effect exploits.
  • Latency unbudgeted. A guardrail at 400ms p50 becomes the latency story for the whole product and someone files a ticket to disable it.
  • One F1 on a mixed set. Score precision on benign and recall on adversarial, per category. The number you optimize against is the number you fail against in production.
  • Same policy for every tenant. Multi-tenant products need per-tenant rules.
  • No audit trail. Compliance asks “what blocked this” and you cannot answer.

How Future AGI ships the guardrail stack

Three components, one loop.

  • Future AGI Protect is the runtime ML guardrail. Four Gemma 3n LoRA adapters plus Protect Flash. 65ms text / 107ms image median time-to-label per arXiv 2510.13351. Closed weights, hosted at api.futureagi.com; the agentcc-gateway plugin self-hosts and carries deterministic regex and lexicon fallbacks for zero-AI-credit usage. Two-layer architecture, per-tenant pipeline mode.
  • ai-evaluation SDK (Apache 2.0). from fi.evals import Evaluator, Protect, Guardrails. 13 guardrail backends (9 open-weight: LLAMAGUARD_3_8B/1B, QWEN3GUARD_8B/4B/0.6B, GRANITE_GUARDIAN_8B/5B, WILDGUARD_7B, SHIELDGEMMA_2B; 4 API: OPENAI_MODERATION, AZURE_CONTENT_SAFETY, TURING_FLASH, TURING_SAFETY). 8 sub-10ms Scanners. RailType.INPUT/OUTPUT/RETRIEVAL plus AggregationStrategy.ANY/ALL/MAJORITY/WEIGHTED. The 60+ EvalTemplate classes reuse as offline rubrics, so the production policy and the regression suite share weights.
  • Agent Command Center is the deployment surface. 33 guardrail scanners (18 built-in plus 15 external). 6 native provider adapters plus OpenAI-compatible presets (20+ providers). 5-level hierarchical budgets. Per-tenant policy pipelines. Audit headers on every response. A 17 MB Go binary that self-hosts in your VPC with the OpenAI-compatible base URL https://gateway.futureagi.com/v1.

The loop: SDK scores in CI, gateway enforces in production, Protect runs the inline ML, the same adapters reuse offline, audit lands in the trace tree. The classifier is one piece; the system is the answer.

Frequently asked questions

What is an LLM guardrail, exactly?
An LLM guardrail is a runtime control that sits on the request path and decides, in milliseconds, whether to allow, block, redact, or rewrite a piece of content before it reaches the model or the user. It is not an eval. An eval scores behavior offline against a labeled set; a guardrail enforces policy online against live traffic. They share rubrics in mature stacks, but the failure modes differ. An eval that misses a positive is a regression to fix in the next sprint. A guardrail that misses a positive is an incident report by lunch.
Where in the request path do guardrails belong?
Four placement points, not three. Input rail runs on the user message before the model call (jailbreaks, prompt injection, PII probes). Output rail runs on the model response before the gateway returns it (toxic content, leaked secrets, off-topic generation). Retrieval rail runs on retrieved chunks before they enter the prompt (indirect prompt injection through poisoned documents). Tool-call rail runs on the arguments the model wants to pass to a tool before the tool executes (secret exfiltration, SQL injection, destructive actions). Production stacks run all four with per-rail policies; the tool-call rail is the one teams skip and regret.
Llama Guard vs Qwen3-Guard vs Granite Guardian vs WildGuard vs ShieldGemma. Which one do I pick?
None of them in isolation. Llama Guard 3 (8B and 1B) is strongest on multi-category content moderation and ships with a clean taxonomy. Qwen3-Guard covers 119 languages and is the right pick for non-English traffic. Granite Guardian (IBM, 8B and 5B) leads on prompt-injection and hallucination categories. WildGuard 7B from AI2 has the highest precision on benign sets in our internal scoring. ShieldGemma 2B is the latency play when 8B is too expensive. Pick two with non-overlapping strengths, ensemble them with ANY for high-stakes categories, single-model for routine.
What is the latency budget for an inline guardrail?
Sub-100ms p50 for the input rail, sub-150ms p50 for the output rail. Above 200ms p50, the guardrail becomes the latency story for the whole product and someone files a ticket to disable it. Future AGI Protect runs at 65ms text and 107ms image median time-to-label per arXiv 2510.13351, which is the budget that makes per-request inline enforcement economical instead of aspirational. Heuristic regex and lexicon fallbacks in the gateway plugin run in single-digit milliseconds and absorb the cases where the ML hop is unavailable or unbudgeted.
Why is a single F1 the wrong score for a guardrail?
Because a guardrail fails two different ways and a single F1 hides which. Too loose, harmful content passes (low recall on adversarial). Too strict, legitimate requests get blocked (low precision on benign, which the F1 won't surface if your test set is adversarial-heavy). The right scoring is precision AND recall, measured separately on an adversarial set and a benign set, per category. A model with 0.97 recall on jailbreaks and 0.99 precision on benign customer support is the bar. A model with 0.92 F1 on a mixed set could be the same model — or one that blocks one in fifteen real users.
Do I need a commercial guardrail like Lakera or Guardrails AI if I have Llama Guard?
Open-weight guardrail models give you the inline classifier. They do not give you the policy engine, the per-tenant config surface, the audit trail, the multi-modal coverage, or the streaming-aware decision loop. Commercial products like Lakera, Guardrails AI, and NeMo Guardrails wrap the classifier with that surface. Future AGI's split is two layers — Protect is the ML guardrail (closed weights, hosted at api.futureagi.com), Agent Command Center is the gateway with 33 scanners, per-tenant policies, and the audit headers that show up in your trace tree. You can run the open-weight model alone for a prototype; for production you want the layer above it.
How does Future AGI ship guardrails end to end?
Three components, one loop. Future AGI Protect is the runtime ML guardrail — four Gemma 3n LoRA adapters (toxicity, bias_detection, prompt_injection, data_privacy_compliance) plus a Protect Flash binary classifier, 65ms text and 107ms image, weights closed, served from api.futureagi.com with deterministic regex and lexicon fallbacks running locally in the agentcc-gateway plugin. The ai-evaluation SDK (Apache 2.0) ships 13 guardrail backends (9 open-weight, 4 API), 8 sub-10ms Scanners, and the RailType / AggregationStrategy primitives that let you stack and route them. Agent Command Center is the deployment surface — 33 scanners on the gateway path (18 built-in plus 15 external), per-tenant pipelines, 5-level budgets, and audit headers on every response.
Related Articles
View all
The Comprehensive Guide to LLM Security (2026)
Guides

LLM security is four layers — input, output, retrieval, tool-call. Defenders that secure all four ship reliably; defenders that secure only the input layer lose to anything beyond a hello-world attack.

NVJK Kartik
NVJK Kartik ·
17 min