Research / Guardrails & Safety
Guardrails & Safety

Protect: Towards Robust Guardrailing Stack for Trustworthy Enterprise LLM Systems

We introduce Protect, a natively multi-modal guardrailing model operating across text, image, and audio, achieving state-of-the-art performance across toxicity, sexism, data privacy, and prompt injection.

Karthik Avinash, Nikhil Pareek, Rishav Hada | | Future AGI Research

Abstract

The increasing deployment of Large Language Models (LLMs) across enterprise and mission-critical domains has underscored the urgent need for robust guardrailing systems that ensure safety, reliability, and compliance. Existing solutions often struggle with real-time oversight, multi-modal data handling, and explainability. We introduce Protect, a natively multi-modal guardrailing model designed to operate seamlessly across text, image, and audio inputs for enterprise-grade deployment.

Protect integrates fine-tuned, category-specific adapters trained via Low-Rank Adaptation (LoRA) on an extensive, multi-modal dataset covering four safety dimensions: toxicity, sexism, data privacy, and prompt injection. Our teacher-assisted annotation pipeline leverages reasoning and explanation traces to generate high-fidelity, context-aware labels across modalities.

Key Results

Protect achieves state-of-the-art performance across all four safety dimensions on text modality, surpassing existing open and proprietary models including WildGuard, LlamaGuard-4, and GPT-4.1:

  • Leads in accuracy for every safety category
  • Delivers strong Failed-class F1 scores critical for identifying safety violations
  • Performs on par with GPT-4.1 while exceeding it in several metrics including Prompt Injection detection
  • Median Time-to-Label of just 65ms for text and 107ms for image, enabling real-time safety decisions

Methodology

Multi-Modal Training Pipeline

Our system employs a teacher-assisted relabeling pipeline using Gemini-2.5-Pro to systematically improve label quality. Initial audits revealed many public safety dataset annotations were based on keyword tagging rather than contextual assessment, leading to high false positive rates.

For each data point, the teacher model generates a ‘thinking’ process and an ‘explanation’ for its classification, along with a final proposed label. The teacher model disagreed with approximately 21% of original labels, underscoring the significance of the relabeling effort.

Audio Synthesis and Augmentation

To address the scarcity of public labeled audio safety datasets, we synthesized a large-scale audio dataset using CosyVoice 2.0 TTS, seeded with ~200 reference speaker prompts covering diverse accents, speaking rates, and emotional tones. Background noise augmentation at varying SNR levels ensures robustness to real-world acoustic conditions.

Training Variants

We experimented with four output format variants:

  1. Vanilla Assistant - Classification label only
  2. Thinking Assistant - Reasoning process + label
  3. Explanation Assistant - Label + justification (selected for deployment)
  4. Comprehensive Assistant - Full reasoning pipeline

The Explanation Assistant variant offers the best balance of performance and explainability for production deployment.

Conclusion

Protect establishes a strong foundation for trustworthy, auditable, and production-ready safety systems. By unifying text, image, and audio modalities under a common fine-tuning and annotation framework, Protect delivers broad coverage across four key safety categories while maintaining low latency suitable for real-time enterprise applications.

guardrails multi-modal safety toxicity prompt injection enterprise

Try Future AGI

Put this research into practice. Start for free.

Get started free