AI Evaluations

Company News

Protect: Trustworthy AI Guardrails for Enterprises

Protect: Trustworthy AI Guardrails for Enterprises

Protect: Trustworthy AI Guardrails for Enterprises

Protect: Trustworthy AI Guardrails for Enterprises

Protect: Trustworthy AI Guardrails for Enterprises

Protect: Trustworthy AI Guardrails for Enterprises

Protect: Trustworthy AI Guardrails for Enterprises

Last Updated

Oct 21, 2025

Oct 21, 2025

Oct 21, 2025

Oct 21, 2025

Oct 21, 2025

Oct 21, 2025

Oct 21, 2025

Oct 21, 2025

By

Rishav Hada
Rishav Hada
Rishav Hada

Time to read

14 mins

Table of Contents

TABLE OF CONTENTS

  1. Introduction

As Artificial Intelligence rapidly moves from labs to boardrooms, one question looms large - how do we keep large language models (LLMs) safe, reliable, and compliant in the real world?

From financial chatbots to healthcare assistants, enterprises rely on AI systems to handle sensitive data and high-stakes decisions. But without proper guardrails, these models can hallucinate, leak private information, or even be tricked into unsafe actions through prompt injection attacks.

That’s where Future AGI’s Protect comes in - a next-generation, multi-modal guardrailing framework built to make enterprise AI safe, explainable, and production-ready.

📃 Read the full research paper here.


  1. What Are AI Guardrails and Why Current Ones Fall Short

Guardrails act as the safety filters for AI systems. They monitor what goes in (user prompts) and what comes out (AI responses), ensuring compliance with company policies and regulatory standards.

However, most existing guardrails share three major weaknesses:

  • Text-only focus - They can’t handle images or audio, even though modern enterprises use voice assistants, visual search, and document understanding tools daily.

  • Lack of explainability - They flag issues but rarely explain why, which limits trust and makes auditing difficult.

  • Slow and fragmented systems - Chaining multiple external safety checks adds latency and complexity.

In industries like finance, healthcare, and public services, where regulations are strict and decisions have real-world impact, these gaps make legacy guardrails unfit for deployment.


  1. Introducing Protect: A Multi-Modal Guardrailing Stack

Protect is a first-of-its-kind, enterprise-grade guardrailing system designed to work seamlessly across text, image, and audio inputs - the full range of today’s multi-modal AI.

At its core, Protect combines three key innovations:

  1. Multi-Modal Safety Intelligence
    Protect doesn’t just scan text. It can analyze spoken conversations, screenshots, memes, or visual content to detect risks like toxicity, sexism, data leaks, or prompt injection attempts.

  2. Teacher-Assisted Annotation Pipeline
    A “teacher” model helps generate smarter, context-aware safety labels by reasoning about why something might be unsafe. This improves accuracy and interpretability - a big step up from basic keyword filters, enriching the dataset quality.

  3. Lightweight, Real-Time Performance
    Protect uses Low-Rank Adaptation (LoRA) fine-tuning to stay fast and efficient, making it ideal for on-device or cloud deployments with minimal latency.

In short, Protect gives enterprises a unified safety layer for all forms of AI interaction - from a chatbot message to a call center recording to a product image.


  1. Four Critical Safety Dimensions Your Business Needs

Protect focuses on four areas that matter most to businesses:

  1. Toxicity Detection Catches hate speech, harassment, and offensive language before it reaches your customers. This protects your brand reputation and creates safer customer interactions.

  2. Gender Bias Prevention Identifies and blocks sexist content or gender discrimination. This is crucial for maintaining inclusive workplace communications and customer-facing content.

  3. Data Privacy Protection Prevents accidental exposure of sensitive information like credit card numbers, social security numbers, medical records, or personal addresses. This helps you stay compliant with GDPR, CCPA, and other privacy regulations.

  4. Prompt Injection Defense Stops attackers from manipulating your AI systems through clever prompts designed to bypass safety rules. Think of it as protection against AI "hacking" attempts.


  1. Inside the Dataset: Teaching AI What “Unsafe” Means

A guardrail is only as good as the data it learns from. Protect’s team curated one of the most diverse multi-modal safety datasets to date, spanning:

  • Text datasets: (from sources like WildGuardTest, ToxicChat, and ToxiGen)

  • Image datasets: (including Hateful Memes, VizWiz-Priv, and graphical violence collections)

  • A large-scale, custom-synthesized Audio dataset

Each data point was carefully categorized under four safety dimensions: Toxicity, Sexism, Data Privacy, and Prompt Injection.

To create a unique audio dataset, our team synthesized the existing text samples using a sophisticated process. By systematically varying accents, emotions, and speaking rates, and adding realistic background noise, we built a dataset that teaches Protect to recognize crucial risk factors in a speaker’s tone - like sarcasm or anger - that plain text transcripts would miss. The diagram below illustrates our end-to-end pipeline for this process:


A key part of this pipeline was generating very specific commands to control the speech synthesis. Here are a few examples that show how we controlled the accent, emotion, and style for each audio clip:


  1. Smarter Labeling Through “Teacher-Assisted” Learning

Traditional safety datasets rely on keyword tagging - a method that often misclassified nuanced or context-dependent content. Protect fixes this using a teacher-assisted relabeling pipeline:

  • The teacher model first explains its reasoning (“thinking trace”) before suggesting a label (Passed/Failed).

  • Human reviewers then validated these automated suggestions through iterative audits, ensuring the final labels were high-quality, consistent, and accurate.

  • The result: fewer false positives and a dataset enriched with rationales for why something was unsafe.

This approach cut labeling disagreements by over 20%, meaning Protect’s training data became cleaner, more consistent, and better suited for regulated enterprise contexts where transparency is critical.


  1. Training the Guardrail: Four Specialized Safety Adapters

Rather than using one giant model for everything, Protect employs four small, specialized adapters, each fine-tuned for a specific safety task (toxicity, sexism, privacy, and prompt injection).

These adapters were trained under different configurations - some focusing purely on classification (“Vanilla”), while others generated reasoning or explanations before giving a verdict (“Thinking” and “Explanation” variants). Here’s what those different output formats look like in practice for a single prompt injection attempt:


Our results showed that different adapter styles excelled at different tasks. While the simple 'Vanilla' adapter was most effective for clear-cut violations like Prompt Injection, the 'Explanation' variant proved superior for nuanced categories like Sexism. This highlights the importance of explainability, which is critical for audit trails and building trust in enterprise settings.


  1. Results: Beating the Best (Even GPT-4.1)

Protect’s performance on benchmark tests was exceptional. When we compared its ability to catch critical safety violations against other models, the results were clear:


Protect proved to be highly competitive with proprietary giants like GPT-4.1 and even outperformed them in critical enterprise scenarios. It was significantly better at catching prompt injection and privacy violations-two of the most difficult enterprise safety challenges.

And it did all this while staying lightweight enough for real-time deployment. Protect’s average decision latency was around 67 ms for text and 109 ms for images, making it one of the fastest guardrails for production workloads.


  1. Why This Matters for Enterprises

AI guardrails aren’t just a security feature anymore, they’re a business requirement. As companies move toward AI-powered automation, voice agents, and data-driven insights, the need for trust, explainability, and auditability grows exponentially.

Protect delivers on all three fronts:

  • Native multi-modal coverage with text, image, and audio in one stack

  • Real-time safety checks without slowing applications

  • Transparent explanations for every decision

  • Open-source LoRA adapters trained exclusively on our text dataset, providing a transparent benchmark for text modality safety.

This makes Protect especially relevant for regulated industries like finance, healthcare, government, and education, where a single misstep in data or compliance can have massive consequences.


  1. . Real Business Applications

  • Customer Service Centers Monitor voice calls and chat messages in real-time to ensure quality interactions and catch potential PR disasters before they escalate.

  • Content Moderation Platforms Automatically screen user-generated content across text, images, and video for social media platforms, forums, or review sites.

  • Healthcare Communications Protect patient privacy by automatically detecting and redacting PHI (Protected Health Information) in transcriptions, chat logs, and documentation.

  • Financial Services Prevent accidental disclosure of account numbers, SSNs, and other sensitive financial data in customer communications.

  • HR and Workplace Tools Maintain inclusive workplace communications by detecting and flagging biased or discriminatory language in emails, chat systems, and documents.


Conclusion

As AI becomes more sophisticated and integrated into business operations, safety can't be an afterthought. Protect represents the next generation of AI guardrails- comprehensive, multimodal, and built for real-world enterprise deployment.

The system's combination of multi-modal coverage, low latency, high accuracy, and explainable decisions makes it ideal for regulated industries where both performance and auditability matter.

With businesses facing increasing regulatory scrutiny around AI systems, having robust guardrails isn't just about avoiding problems - it's about building trust with customers and demonstrating responsible AI deployment. Our team is committed to the ongoing improvement of Protect, continuously expanding its knowledge base to better handle complex and nuanced content.


Ready to Secure Your AI Systems?

Protect your business with enterprise-grade AI safety. The text-based models are available as open source, allowing your team to evaluate and integrate them into your existing systems.

Learn more about Protect and access the models from HuggingFace. See how multi-modal guard railing can give you confidence in your AI deployments.

Get started or Contact our team for enterprise deployment support, custom training, or integration consulting. Let's build safer AI systems together.

Table of Contents

Table of Contents

Table of Contents

Rishav Hada is an Applied Scientist at Future AGI, specializing in AI evaluation and observability. Previously at Microsoft Research, he built frameworks for generative AI evaluation and multilingual language technologies. His research, funded by Twitter and Meta, has been published in top AI conferences and earned the Best Paper Award at FAccT’24.

Rishav Hada is an Applied Scientist at Future AGI, specializing in AI evaluation and observability. Previously at Microsoft Research, he built frameworks for generative AI evaluation and multilingual language technologies. His research, funded by Twitter and Meta, has been published in top AI conferences and earned the Best Paper Award at FAccT’24.

Rishav Hada is an Applied Scientist at Future AGI, specializing in AI evaluation and observability. Previously at Microsoft Research, he built frameworks for generative AI evaluation and multilingual language technologies. His research, funded by Twitter and Meta, has been published in top AI conferences and earned the Best Paper Award at FAccT’24.

Related Articles

Related Articles

future agi background
Background image

Ready to deploy Accurate AI?

Book a Demo
Background image

Ready to deploy Accurate AI?

Book a Demo
Background image

Ready to deploy Accurate AI?

Book a Demo
Background image

Ready to deploy Accurate AI?

Book a Demo
Background image

Ready to deploy Accurate AI?

Book a Demo
Background image

Ready to deploy Accurate AI?

Book a Demo
Background image

Ready to deploy Accurate AI?

Book a Demo
Background image

Ready to deploy Accurate AI?

Book a Demo