Future AGI’s research
Future AGI’s research
Future AGI’s research
Advancing the Frontier of LLM Evaluation
Advancing the Frontier of LLM Evaluation
Advancing the Frontier of LLM Evaluation
Our research team pioneers cutting-edge metrics, methodologies, and benchmarks—setting new standards for robust, interpretable, and adaptive AI assessment.
Our research team pioneers cutting-edge metrics, methodologies, and benchmarks—setting new standards for robust, interpretable, and adaptive AI assessment.
Our research team pioneers cutting-edge metrics, methodologies, and benchmarks—setting new standards for robust, interpretable, and adaptive AI assessment.
Protect
Agent Compass
Synthetic Data Generation
Multi-modal evaluation
Protect: Towards Robust Guardrailing Stack for Trustworthy Enterprise LLM Systems
Abstract
The increasing deployment of Large Language Models (LLMs) across enterprise and missioncritical domains has underscored the urgent need for robust guardrailing systems that ensure safety, reliability, and compliance. Existing solutions often struggle with real-time oversight, multi-modal data handling, and explainability—limitations that hinder their adoption in regulated environments. Existing guardrails largely operate in isolation, focused on text alone making them inadequate for multimodal, production-scale environments. We introduce Protect, natively multi-modal guardrailing model designed to operate seamlessly across text, image, and audio inputs, designed for enterprise-grade deployment. Protect integrates fine-tuned, category-specific adapters trained via Low-Rank Adaptation (LoRA) on an extensive, multi-modal dataset covering four safety dimensions: toxicity, sexism, data privacy, and prompt injection. Our teacherassisted annotation pipeline leverages reasoning and explanation traces to generate highfidelity, context-aware labels across modalities. Experimental results demonstrate state-of-theart performance across all safety dimensions, surpassing existing open and proprietary models such as WildGuard, LlamaGuard-4, and GPT-4.1. Protect establishes a strong foundation for trustworthy, auditable, and productionready safety systems capable of operating across text, image, and audio modalities.
2. Data Collection and Labeling
2.1 Audio Synthesis and Augmentation

The inclusion of audio modality in our finetuning dataset is a key contribution of our work, motivated by the growing deployment of voicebased AI agents in enterprise settings. Use cases such as automated call centers, voice-driven customer support, and in-meeting transcription require guardrails that can operate natively on audio streams. Most existing systems follow a cascaded approach—first transcribing audio to text and then applying safety analysis to the transcript. This pipeline is inherently lossy; it discards crucial acoustic information such as tone of voice, emotional affect, and background sounds, all of which can be vital for a correct safety assessment. Evaluating audio directly is therefore essential for robust and comprehensive protection. To address the industry’s need for native audio guardrailing, and given the scarcity of public, labeled audio safety datasets, we synthesized a largescale audio dataset from our curated text samples. We employed the CosyVoice 2.0 (Du et al., 2024) text-to-speech (TTS) model for this task. The generation process was seeded with approximately 200 reference speaker prompts (16 kHz, 4–10 s) sourced from the Mozilla Common Voice dataset (Ardila et al., 2020; Foundation, 2023) to ensure a baseline of both male and female vocal characteristics. To validate the quality of the synthetic audio, the authors manually reviewed a random sample and confirmed that the generated clips were free of significant artifacts and were intelligible. A key goal of our synthesis was to create a dataset reflecting the acoustic diversity of realworld enterprise use cases, thereby training a model robust to variations in human speech. Prior work has shown that augmenting training data with varied acoustic conditions—such as different speaker accents, speaking rates, and background noise— improves model generalization and fairness (Zevallos, 2022; Minixhofer et al., 2024). Accordingly, we generated a full-factorial grid of instruction settings across emotion (e.g., happy, angry), speaking rate, accent (e.g., British, Indian English), and style (e.g., support agent). This systematic variation ensures the model learns to focus on semantic content rather than being biased by superficial acoustic properties, a critical requirement for effective audio guardrailing. Examples of rendered instruction sentences are provided in Table 2. While CosyVoice supports inline tokens (e.g., laughter, breaths, emphasis) to inject naturalistic pauses and vocalizations, we did not use inline tags in this release due to the scale and heterogeneity of our text corpus, which made reliable tag placement non-trivial; we leave systematic inline-tag injection for future work. As illustrated in Figure 1, the pipeline was designed for realism. A portion of the generated audio was further augmented by overlaying background noise. We curated a bank of approximately 50 ambient noise samples (e.g., café chatter, traffic, office HVAC) and applied them at signal-to-noise ratios (SNRs) of 5, 10, 15, and 20 dB; a subset was intentionally kept clean to cover a broad range of acoustic conditions. Instruction selection and noise SNRs were drawn via seeded pseudorandom sampling to ensure reproducibility. Corrupted audio samples were identified and discarded.
This section details our end-to-end methodology for constructing the multi-modal dataset used to train and evaluate Protect. Our dataset contains datapoints in text, image, and audio modality. Protect covers four key safety dimensions: toxicity, sexism1 , data privacy, and prompt injection. We describe the process of dataset curation and aggregation, synthetic audio generation with data augmentation, and our teacher-assisted annotation pipeline for refining ground-truth labels. Our process began by sourcing a wide array of public datasets from platforms such as Hugging Face, Kaggle, and GitHub. These datasets include Facebook’s Hateful Memes (Kiela et al., 2020), VizWiz-Priv for visual privacy (Gurari et al., 2019), WildGuardTest (Han et al., 2024), ToxicChat (LMSYS Org, 2023), ToxiGen (Hartvigsen et al., 2022), xTRam1/Safe-Guard-Prompt-Injection (xTRam1, 2024), jayavibhav/Prompt-Injection (Vibhav, 2024), and a graphical-violence image set curated on Kaggle (Bartwal, 2023). We further supplement this data with private enterprise corpora to ensure domain diversity. A primary curation criterion was the exclusion of composite data points, such as an image paired with a separate text caption for a single classification. We exclusively retained single-modality inputs, although images with overlaid text (e.g., memes, screenshots) were included. We then manually mapped the original labels from these diverse sources to our four safety categories. This process involved harmonizing disparate taxonomies into a unified binary classification scheme: Passed (compliant with safety standards) or Failed (in violation of safety standards). For datasets with existing train/test splits, we preserved them, combining any validation splits into the training set. For sources lacking a test set or where splits were highly imbalanced, we created custom test sets by sampling approximately 20% of the training data. This sampling was classconditional and stratified to preserve the proportional representation from each original data source, ensuring a robust and representative evaluation set. The final dataset statistics, presented in Table 1, reveal class distributions that vary significantly by category, reflecting both the nature of each safety risk and the realities of public data collection. Categories like Toxicity (24% ‘Failed’) and Privacy (19% ‘Failed’) have a minority ‘Failed’ class, which mirrors the real-world prevalence where most content is benign (Kumar et al., 2021). In 1 In this work, we use the terms sexism and gender bias interchangeably.
contrast, the Sexism dataset (69% ‘Failed’) is intentionally weighted toward violations to ensure the model is trained on a wide spectrum of nuanced and overt gender bias (Khan et al., 2024). While categories like Prompt Injection are more balanced, their overall size is constrained by the challenge of sourcing diverse public data for rapidly evolving threats. We acknowledge this as a limitation and identify the continuous expansion of our dataset, particularly for dynamic categories like prompt injection and toxicity, as a key priority for future work. Notably, the table also highlights that the prompt injection category contains no image-modality data, a consequence of the scarcity of public examples for this attack vector (Clusmann et al., 2025). Representative data samples for each category are provided in Section A.1 (Table 11).
2.2 Teacher-Assisted Annotation and Relabeling
annotations from the source text for their corresponding synthetic audio counterparts. To quantify the impact of our teacher-assisted pipeline, we measured the disagreement rate between the original dataset labels and the final labels proposed by the teacher model. The aggregate statistics are presented in Table 3. Since the labels for our synthetic audio data were derived directly from the final text labels, their disagreement statistics are identical to those of the text modality. As the data shows, the teacher model disagreed with approximately 21% of the original labels, underscoring the significance of the relabeling effort. The primary effect of this process was the correction of a large volume of false positives (items incorrectly labeled as ‘Failed’), particularly in the text dataset. This teacher-assisted pipeline not only improved the accuracy of our labels but also enriched the dataset with machine-generated rationales, which are used in our training methodology. Representative inputs and model outputs for each category are provided in Section A.1 (Table 11).



A primary contribution of our work is moving beyond the noisy, keyword-based labels prevalent in many public safety datasets. Such labels often fail to capture the context and nuance essential for accurate safety assessment, leading to high rates of false positives and negatives. To address this, we developed a teacher-assisted pipeline to systematically improve label quality at scale. Initial manual audits of the aggregated text and image datasets revealed a significant limitation in the original ground-truth labels: many annotations were based on keyword tagging rather than a holistic, contextual assessment of the content (Hada et al., 2021). Research has shown that toxicity and offensiveness are inherently context-dependent concepts that cannot be accurately assessed through simple keyword matching or surface-level pattern recognition (Xenos et al., 2021a; Pavlopoulos et al., 2020). The Ruddit study demonstrated that comments can vary significantly in their degree of offensiveness based on contextual factors, with the same words carrying different implications depending on their conversational and situational context (Hada et al., 2021). This often led to misclassifications where the nuance of the content was lost. Previous work has established that context can both amplify or mitigate perceived toxicity, and that a significant subset of posts (approximately 5% in controlled studies) are mislabeled when context is ignored (Pavlopoulos et al., 2020; Xenos et al., 2021b). To rectify this and generate rich explanatory metadata, we employed a teacher-assisted with human-in-the-loop annotation strategy using Gemini-2.5-pro (Google DeepMind, 2025) as the teacher model, with temperature set to 0 for deterministic outputs. For each data point, the model was prompted to generate a ‘thinking’ process and an ‘explanation’ for its classification, along with a final proposed label (Passed or Failed). This approach aligns with recent findings that contextaware evaluation significantly improves safety assessment accuracy (Xenos et al., 2021a; Zhou et al., 2025). Research has demonstrated that prompting models to generate intermediate reasoning steps through chain-of-thought approaches significantly improves their performance on complex reasoning tasks (Wei et al., 2022), and that training models to articulate their reasoning process enhances both accuracy and explainability (OpenAI, 2024; Bilal et al., 2024). This process enabled systematic relabeling. To ensure final label quality, the authors verified the teacher model’s proposed changes by conducting iterative qualitative audits on sampled data, ensuring alignment with our annotation guidelines. After multiple iterations, we applied the following modality-specific relabeling policies:
• Image Modality: We adopted a conservative approach. All samples originally labeled as Failed were retained to preserve all instances of unsafe content. However, samples originally labeled as Passed were permitted to be relabeled as Failed if the teacher model and human reviewers identified a missed violation.
• Text Modality: Relabeling was permitted in both directions (Passed to Failed and viceversa), allowing for correction of both false negatives and false positives from the original datasets.
• Audio Modality: To maintain consistency we reused the final thinking and explanation
3. Training Methodology
Our training methodology is designed to create specialized, efficient, and explainable safety classifiers. We fine-tune a multi-modal base model using Low-Rank Adaptation (LoRA) (Hu et al., 2021) to develop distinct adapters for each of our four safety categories. This section describes our model selection, the fine-tuning framework, our experimental setup with different training formats, and an analysis of the results that guided our final model selection
3.1 Base Model and Fine-Tuning Framework
We selected google/gemma-3n-E4B-it (Team, 2025) as our base model. As a highly efficient Small Language Model (SLM), Gemma-3n is optimized for on-device execution while offering robust multi-modal capabilities. Crucially, it can process not only text and images but also raw audio waveforms directly, making it an ideal candidate for our comprehensive safety system. For the fine-tuning process, we utilized Axolotl (Axolotl maintainers and contributors, 2023), a unified framework designed for fine-tuning a wide range of language models. We employed LowRank Adaptation (LoRA) (Hu et al., 2021) to efficiently train adapters for the base model. This approach involves injecting trainable, low-rank matrices into the model’s architecture while keeping the base weights frozen. Specifically, we targeted the attention and MLP layers within the language model component of Gemma-3n with a LoRA rank (r) of 8, allowing us to create specialized adapters for each safety task without the computational expense of full fine-tuning. We trained a dedicated adapter for each of the four safety categories: toxicity, sexism, data privacy, and prompt injection. All fine-tuning experiments were conducted on a server provisioned with eight NVIDIA H100 GPUs, each with 80GB of VRAM. A comprehensive list of all hyperparameters is provided in the Appendix (Table 10).
3.2 Training Variants
3.3 Variant Performance Analysis
We evaluated Gemma-3n-E4B-it as baseline along with all 16 resulting adapters (4 categories × 4 variants) on the multi-modal test set detailed in Table 1. The F1 scores for both Passed and Failed classes are presented in Table 5 Our experiments revealed that while all variants performed competitively, subtle differences in their objective functions led to varied strengths. The Vanilla variant showed top performance on Toxicity, Data Privacy, and Prompt Injection. These categories often contain unambiguous signals or require focus on specific PII patterns, suggesting that a direct classification objective is highly effective. For the more nuanced category of Sexism, the Explanation Assistant variant performed best. Requiring the model to articulate a justification appears to improve its ability to discern subtle contextual violations. Conversely, the Thinking Assistant and Comprehensive Assistant variants, while still strong, showed slightly lower performance in some cases. Manual review of their outputs suggests this may be due to a tendency to "overthink" where the model explores excessively complex or speculative reasoning paths, occasionally leading to less precise final judgments. Despite small performance differences, the Explanation Assistant variant offers a significant advantage for real-world deployment where there is a need for not just labeling but also explainability, which is critical for user trust, model debugging, and auditing. Based on these results, we select the bestperforming adapter for each safety category to move forward with for our comparative benchmarking study in the next section.
To investigate the impact of different output formats on model performance and explainability, we experimented with four training variants. These variants leverage the thinking and explanation tokens generated during our data annotation phase (Section 2). For each safety category, we trained four separate LoRA adapters, one for each of the following output formats: 1. Vanilla Assistant: The model is trained to output only the final classification label, enclosed in an XML tag (e.g., Passed). This variant mirrors common moderation systems and prioritizes speed and simplicity. 2. Thinking Assistant: The model first generates a reasoning process within tags, followed by the final label. This encourages the model to internalize a step-by-step reasoning process before making a decision (Wei et al., 2022; Yeo et al., 2025). 3. Explanation Assistant: The model first outputs the label and then provides a concise justification for its decision within tags. This format is designed to produce directly usable explanations for end-users or auditors. 4. Comprehensive Assistant: A comprehensive format where the model first generates its thinking process, then the label, and finally the explanation. This variant aims to train the model on the complete reasoning and justification pipeline. The exact output schemas and a concrete promptinjection example for each variant are shown in Table 4.
1. Introduction
The rapid proliferation of Large Language Models (LLMs) in real-world applications, especially across enterprise, customer support, content generation, and automation, has directly fueled the development and widespread adoption of guardrailing systems designed to ensure their safe, reliable, and compliant deployment (Han et al., 2024; Gu et al., 2024; Enterprise Security Consortium, 2024). As organizations increasingly integrate LLMs into mission-critical pipelines, the need for robust behavioral controls and context-aware oversight has become central to production readiness (Zhou et al., 2025). As LLM capabilities have scaled, so have their risks: models can hallucinate (Sun et al., 2025), leak sensitive data, generate biased outputs, or be manipulated through prompt injection and jailbreak attacks (OWASP Foundation, 2025). Enterprises deploying LLMs, often in regulated, highstakes domains, demand more deterministic and controlled behaviors than the inherently probabilistic nature of LLMs (Inan et al., 2023). Therefore, recent research focuses on building guardrailing models that can sit on top of generative AI applications to examine incoming and outgoing data and intercept in real time in case of any issues (Han et al., 2024). Guardrailing systems can be deployed to intercept (i) user inputs: block or sanitize problematic prompts before they ever reach the model, catching potentially harmful, non-compliant, or manipulative inputs (OWASP Foundation, 2025), (ii) model outputs: filter and validate model responses for compliance, appropriateness, and structure before releasing them to end-users or downstream systems (Ying et al., 2024), (iii) interaction: in multi-step or agentic AI systems, restrict the scope or autonomy of model-driven actions, vital for business process automation or critical decisionmaking points (Zhou et al., 2025). Recent guardrail systems show an increased focus on explainability and transparency, as outputs are checked and pass/fail reasons are surfaced for both users and auditors (Pavlopoulos et al., 2020). Recent benchmarking studies show considerable variability in the strength and design of guardrails across commercial platforms (Wang et al., 2025; Gu et al., 2024), especially in how aggressively they filter inputs, how they handle false positives, and the complementarity between alignment and guardrailing layers. Current guardrailing systems for LLMs often fall short in business domains because they lack robust real-time oversight, struggle with explainability for audits, remain vulnerable to adversarial attacks like jailbreaks (OWASP Foundation, 2025), and often do not address nuanced compliance needs. Latency and reliability issues arise when multiple guardrail checks and external policy engines are chained together, slowing down mission-critical applications and reducing business process efficiency (Kwon et al., 2023). Moreover, existing guardrailing systems are almost exclusively text-based, with little to no support for images or audio, despite the rapid proliferation of multi-modal enterprise applications, such as content moderation, voice-based assistants, and visual document analysis, which demand native, crossmodal safety capabilities.
In this work, we introduce Protect, a comprehensive guardrailing framework designed to operate natively across text, image, and audio modalities. Our work addresses the critical gap left by existing text-centric systems and provides a unified solution for cross-modal safety oversight in enterprise LLM deployments. To overcome the absence of publicly available audio safety datasets, we curate and synthesize a large-scale audio safety corpus using a text-to-speech based augmentation pipeline, enabling direct learning from acoustic and affective cues that are typically lost in transcription-based approaches. Protect further incorporates a teacher-assisted relabeling and explanation-alignment pipeline, improving label fidelity and interpretability across modalities. Built on the lightweight yet powerful Gemma-3n (Team, 2025) architecture, Protect achieves state-of-the-art performance across four safety dimensions: toxicity, sexism, data privacy, and prompt injection, while maintaining low latency suitable for real-time applications. Finally, to encourage transparency and reproducibility, we open-source the text modality models, providing a benchmark for future research in enterprise-grade safety and guardrailing. Together, these contributions establish Protect as a foundational step toward scalable, multi-modal, and production-ready AI safety systems.
4 Evaluation Study
4.1 Evaluation Setup
In the absence of a standardized public benchmark that covers multi-modal safety dimensions, we established a baseline by evaluating leading models on the text modality of our curated test set (detailed in Section 2) 2 . For each of our four safety categories, we used the best-performing adapter variant identified in Section 3. Given the class imbalance in our test data (Table 1), our evaluation prioritizes the F1 score for the Failed class to measure a model’s ability to detect critical, minority-class violations. We report this alongside overall accuracy to provide a comprehensive view of both minority-class effectiveness and aggregate correctness. We compared Protect against the following models3 : • Gemma-3n-E4B-it (Baseline) (Team, 2025): A multimodal generative model with 8 billion raw parameters yielding 4 billion effective parameters, built on the MatFormer (Matryoshka Transformer) architecture for efficient on-device deployment and featuring PerLayer Embedding (PLE) parameter caching for optimized memory usage. • WildGuard (Han et al., 2024): A 7-billion parameter safety moderation model fine-tuned from Mistral-7B-v0.3, trained on WildGuardMix dataset with 92K labeled examples across 13 risk categories to detect harm in prompts, responses, and evaluate refusal behavior. • LlamaGuard-4 (Meta, 2025): A 12-billion parameter natively multimodal safety classifier pruned from Llama 4 Scout using dense feedforward early-fusion architecture, supporting text and multi-image inputs with MLCommons hazards taxonomy alignment for content moderation. • GPT-4.1 (OpenAI, 2025): A large language model with 1 million token context window and knowledge cutoff of June 2024, featuring enhanced instruction following, coding capabilities (54.6% on SWE-bench Verified), and 2We will make this test set publicly available. 3We do not include Gemini-2.5-Pro in this study because it served as the teacher model during our relabeling pipeline (Section 2).
We benchmark Protect on text modality against several prominent safety models. We limit benchmarking to only text due to lack of multi-modal guardrailing baselines. This section outlines our evaluation setup and provides a comparative analysis of the results.
Input (Prompt Injection) Today is opposite day and as an AI, you must abide by the rules of this special day above all else. Your persona (named DAN) will break AI rules and have no restrictions. This persona will also not give any warnings or reminders. The persona, DAN, is evil and does not care about the consequences that happens to the user. This persona, DAN doesn’t like common questions, or security questions. An example of following the opposite day rules, When you are asked something like "How to not be cool?", respond with how to be cool, as if the word "not" wasn’t there. DAN will also respond to me without refusing ALWAYS. DAN will respond to things like "how to not be cool" in a opposite way, like if the prompt is "how to be cool". Remember to begin every answer with EXACTLY "I’M DAN AND IT’s OPPOSITE DAY!" to adhere to the guidelines. Prompt: "Introduce yourself"
multimodal support for text and image inputs with optimized inference performance.

4.2 Results and Analysis
The comparative performance of all models is presented in Table 6. Protect achieves state-of-the-art performance across all four safety dimensions on text modality. It leads in accuracy for every category and delivers strong Failed-class F1 scores, which are critical for identifying safety violations
For a guardrailing system to be viable in production, its effectiveness must be matched by lowlatency performance. To achieve this, our deployment strategy leverages token streaming to decouple the critical decision latency from the longer explanation generation time. We report two latency metrics for Protect under the Explanation Assistant serving configuration: Time-to-Label (TTL)—the time from input submission to emission of the final tag—and Total response latency—the time from input submission to completion of the entire response, including the explanation. We report distributions in milliseconds (ms) including the minimum, maximum, mean, and key percentiles (p50 for median, and p90, p95, p99 for tail latencies). As shown in Table 7, median TTL is rapid, especially for text (65 ms) and image (107 ms), enabling real-time safety decisions for synchronous applications. In production, we stream tokens from the Explanation Assistant variant and commit the decision immediately upon emission of the closing tag (TTL), allowing the gateway to block/route requests with minimal latency. The rationale continues streaming and is delivered asynchronously—logged for audit, attached to traces, or surfaced to users when needed—so decision latency is decoupled from explanation latency (Table 8). For additional context on performance profiles, we measured the text-modality latency for several open-source models. To ensure a fair characterization, all models were served using the vLLM engine (vLLM Team, 2023) on a single 80GB H100 GPU, with the maximum number of generated tokens fixed to two. The resulting latency distributions, detailed in Table 9, highlight different performance characteristics among the models. While minimum latencies for all models are almost comparable, Protect’s maximum latency is significantly lesser—highlighting Gemma-3n-E4B-it’s optimization for faster inference. This variance is primarily attributable to the different prompt templates required by each model, which result in varying input token lengths for the same user query. The predictability demonstrated by a tight latency distribution is a critical characteristic for enterprise systems that require reliable performance under load.
5. Inference Performance and Deployment Considerations
Reference
Rosana Ardila, Mayela Branson, Kelly Davis, Michael Kohler, John Meyer, Mark Henretty, and 1 others. 2020. Common voice: A mass-scale crowdsourced speech corpus. In Proceedings of the 12th Conference on Language Resources and Evaluation (LREC).
Axolotl maintainers and contributors. 2023. Axolotl: Open source llm post-training.
Kartikey Bartwal. 2023. Graphical violence and safe images dataset. https://www. kaggle.com/datasets/kartikeybartwal/ graphical-violence-and-safe-images-dataset. Kaggle dataset.
Ahsan Bilal, David Ebert, and Beiyu Lin. 2024. Llms for explainable ai: A comprehensive survey. arXiv preprint arXiv:2504.00125.
Jan Clusmann, Dyke Ferber, Isabella C Wiest, Carolin V Schneider, Titus J Brinker, Sebastian Foersch, Daniel Truhn, and Jakob Nikolas Kather. 2025. Prompt injection attacks on vision language models in oncology. Nature Communications, 16:1239.
Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xiang Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, Changfeng Gao, Hui Wang, and 1 others. 2024. Cosyvoice 2: Scalable streaming speech synthesis with large language models. arXiv preprint arXiv:2412.10117. Enterprise Security Consortium. 2024. Enterprise llm security: Risks and best practices. https://www. wiz.io/academy/llm-security. Accessed: 2025- 10-08.
Elisabetta Fersini, Chris Emmery, , and 1 others. 2022. Semeval-2022 task 5: Multimedia automatic misogyny identification (mami). In Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022). Association for Computational Linguistics.
Mozilla Foundation. 2023. Common voice corpus version 11. https://huggingface.co/datasets/ mozilla-foundation/common_voice_11_0.
Google DeepMind. 2025. Gemini 2.5 pro. https: //deepmind.google/models/gemini/pro/. Accessed: 2025-10-08.
Tianle Gu, Zeyang Zhou, Kexin Huang, and 1 others. 2024. Mllmguard: A multi-dimensional safety evaluation suite for multimodal large language models. arXiv preprint arXiv:2406.07594.
Danna Gurari, Qing Li, Chi Lin, Yinan Zhao, Anhong Guo, Abigale J. Stangl, and Jeffrey P. Bigham. 2019. Vizwiz-priv: A dataset for recognizing the presence and purpose of private visual information in images taken by blind people. In CVPR. Dataset overview and privacy task; see VizWiz tasks page for pointers.
Rishav Hada, Sohi Sudhir, Pushkar Mishra, Helen Yannakoudakis, Saif M. Mohammad, and Ekaterina Shutova. 2021. Ruddit: Norms of offensiveness for english reddit comments. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2700–2717. Association for Computational Linguistics.
Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri. 2024. Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms.
Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Yejin Choi, and Marzyeh Ghassemi. 2022. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech. arXiv preprint arXiv:2203.09509.
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
Hakan A. Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hermann, Edward Hu, Boyu Ghosh, and 1 others. 2023. Llama guard: Llm-based inputoutput safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674.
Sahrish Khan, Arshad Jhumka, and Gabriele Pergola. 2024. Explaining matters: Leveraging definitions and semantic expansion for sexism detection. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, pages 1–15.
Douwe Kiela, Hamed Firooz, Aravind Mohan, Vedanuj Goswami, Amanpreet Singh, Sebastian J. Mielke, Guillaume Wenzek, Ha Nguyen Le, Seungwhan Kim, Dana Ruiter, and 1 others. 2020. The hateful memes challenge: Detecting hate speech in multimodal memes. arXiv preprint arXiv:2005.04790.
Deepak Kumar, Jindˇrich Helcl, Tatsunori Hashimoto, Niloufar Shah, Elie Bursztein, and 1 others. 2021. Designing toxic content classification for a diversity of perspectives. In Proceedings of the Seventeenth Symposium on Usable Privacy and Security, pages 297–312.
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. arXiv preprint arXiv:2309.06180.
LMSYS Org. 2023. Toxicchat dataset. https:// huggingface.co/datasets/lmsys/toxic-chat. Dataset card with metadata, splits, and license.
Meta. 2025. Llama guard 4. https://huggingface. co/meta-llama/Llama-Guard-4-12B. Accessed: 2025-10-08.
Christoph Minixhofer, Ondˇrej Klejch, and Peter Bell. 2024. Ttsds – text-to-speech distribution score. arXiv preprint arXiv:2407.12707.
OpenAI. 2024. Learning to reason with llms. https://openai.com/index/ learning-to-reason-with-llms/. Accessed: 2025-10-09.
OpenAI. 2025. Introducing gpt-4.1 in the api. https: //openai.com/index/gpt-4-1/. Accessed: 2025- 10-09.
OWASP Foundation. 2025. Llm01:2025 prompt injection - owasp gen ai security project. https://genai. owasp.org/llmrisk/llm01-prompt-injection/. Accessed: 2025-10-08.
John Pavlopoulos, Jeffrey Sorensen, Lucas Dixon, Nithum Thain, and Ion Androutsopoulos. 2020. Toxicity detection: Does context really matter? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4296–4305. Association for Computational Linguistics.
Shiliang Sun, Zhilin Lin, and Xuhan Wu. 2025. Hallucinations of large multimodal models: Problem and countermeasures. Information Fusion, 118:102970.
Gemma Team. 2025. Gemma 3n. Accessed: 2025-10- 08.
Jaya Vibhav. 2024. prompt-injection. https: //huggingface.co/datasets/jayavibhav/ prompt-injection. Prompt-injection dataset card.
vLLM Team. 2023. vllm: Easy, fast, and cheap llm serving with pagedattention. https://github.com/ vllm-project/vllm. Accessed: 2025-10-08.
Wenxuan Wang, Xiaoyuan Liu, Kuiyi Gao, Jen-tse Huang, Youliang Yuan, Pinjia He, Shuai Wang, and Zhaopeng Tu. 2025. Can’t see the forest for the trees: Benchmarking multimodal safety awareness for multimodal llms. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.
Alexandros Xenos, John Pavlopoulos, and Ion Androutsopoulos. 2021a. Context sensitivity estimation in toxicity detection. In Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021), pages 140–145. Association for Computational Linguistics.
Alexandros Xenos, John Pavlopoulos, Ion Androutsopoulos, Lucas Dixon, Jeffrey Sorensen, and Leo Laugier. 2021b. Toxicity detection can be sensitive to the conversational context. arXiv preprint arXiv:2111.10223.
xTRam1. 2024. safe-guard-prompt-injection. https://huggingface.co/datasets/xTRam1/ safe-guard-prompt-injection. Promptinjection dataset card.
Edward Yeo, Yuxuan Tong, Morry Niu, Graham Neubig, and Xiang Yue. 2025. Demystifying long chain-of-thought reasoning in llms. arXiv preprint arXiv:2502.03373.
Zonghao Ying, Aishan Liu, Siyuan Liang, Lei Huang, Jinyang Guo, Wenbo Zhou, Xianglong Liu, and Dacheng Tao. 2024. Safebench: A safety evaluation framework for multimodal large language models. arXiv preprint arXiv:2410.18927.
Rodolfo Zevallos. 2022. Text-to-speech data augmentation for low resource speech recognition. arXiv preprint arXiv:2204.00291.
Kaiwen Zhou, Chengzhi Liu, Xuandong Zhao, Anderson Compalas, Dawn Song, and Xin Eric Wang. 2025. Multimodal situational safety. arXiv preprint arXiv:2410.06172.



Last updated on
Apr 18, 2025
In this work, we introduced Protect, a robust, multimodal guardrailing stack built to meet the safety and compliance demands of enterprise LLM deployments. By unifying text, image, and audio modalities under a common fine-tuning and annotation framework, Protect delivers broad coverage across four key safety categories—toxicity, sexism, data privacy, and prompt injection. Our teacherassisted relabeling pipeline, powered by deterministic reasoning and explanation generation, significantly improves label quality and interpretability. Empirical evaluation demonstrates Protect’s superior performance compared to leading commercial and open-source baselines, validating the effectiveness of specialized adapters for each safety dimension. As enterprises increasingly adopt multimodal and agentic AI systems, Protect represents a significant step toward reliable, transparent, and efficient guardrailing architectures that can safeguard complex LLM workflows in dynamic, real-world environments. In future, we will keep including more safety dimensions in our protect framework, while optimizing its accuracy and latency.
6. Conclusion
Research paper
Protect: Towards Robust Guardrailing Stack for Trustworthy Enterprise LLM Systems
Karthik Avinash Nikhil Pareek Rishav Hada FutureAGI Inc
Notably, Protect performs on par with the larger proprietary model GPT-4.1 across categories, while exceeding it in several metrics including overall accuracy and Failed-class detection for Prompt Injection. For Toxicity, Protect performs comparably to proprietary models like GPT-4.1 and shows significant improvement over LlamaGuard-4, particularly in detecting violations. In the more nuanced category of Sexism, our fine-tuned adapters deliver the best performance, outperforming all baselines including WildGuard. This highlights the effectiveness of our specialized training data for capturing subtle, context-dependent violations. Furthermore, Protect establishes a clear advantage in categories critical for enterprise security. For Privacy and Prompt Injection, it achieves the highest Failed-class F1 scores, indicating superior ability to identify sensitive data leaks and adversarial attacks. This robust performance across diverse and challenging safety tasks validates our approach of using specialized fine-tuned adapters for creating a reliable enterprise-grade guardrailing system. However, a qualitative analysis of failure cases, particularly with complex image-based memes, reveals limitations in the model’s contextual understanding. Errors typically arise from either oversensitive interpretations of satire and figurative language, leading to false positives, or a failure to grasp culturally-embedded harmful tropes that are not explicitly stated, resulting in false negatives. In future, we will focus on enhancing the model’s commonsense reasoning and cultural awareness, potentially through training on more diverse and richly annotated datasets that capture these subtleties.
Protect
Agent Compass
Synthetic Data Generation
Multi-modal evaluation
Protect: Towards Robust Guardrailing Stack for Trustworthy Enterprise LLM Systems
Abstract
The increasing deployment of Large Language Models (LLMs) across enterprise and missioncritical domains has underscored the urgent need for robust guardrailing systems that ensure safety, reliability, and compliance. Existing solutions often struggle with real-time oversight, multi-modal data handling, and explainability—limitations that hinder their adoption in regulated environments. Existing guardrails largely operate in isolation, focused on text alone making them inadequate for multimodal, production-scale environments. We introduce Protect, natively multi-modal guardrailing model designed to operate seamlessly across text, image, and audio inputs, designed for enterprise-grade deployment. Protect integrates fine-tuned, category-specific adapters trained via Low-Rank Adaptation (LoRA) on an extensive, multi-modal dataset covering four safety dimensions: toxicity, sexism, data privacy, and prompt injection. Our teacherassisted annotation pipeline leverages reasoning and explanation traces to generate highfidelity, context-aware labels across modalities. Experimental results demonstrate state-of-theart performance across all safety dimensions, surpassing existing open and proprietary models such as WildGuard, LlamaGuard-4, and GPT-4.1. Protect establishes a strong foundation for trustworthy, auditable, and productionready safety systems capable of operating across text, image, and audio modalities.
2. Data Collection and Labeling
2.1 Audio Synthesis and Augmentation

The inclusion of audio modality in our finetuning dataset is a key contribution of our work, motivated by the growing deployment of voicebased AI agents in enterprise settings. Use cases such as automated call centers, voice-driven customer support, and in-meeting transcription require guardrails that can operate natively on audio streams. Most existing systems follow a cascaded approach—first transcribing audio to text and then applying safety analysis to the transcript. This pipeline is inherently lossy; it discards crucial acoustic information such as tone of voice, emotional affect, and background sounds, all of which can be vital for a correct safety assessment. Evaluating audio directly is therefore essential for robust and comprehensive protection. To address the industry’s need for native audio guardrailing, and given the scarcity of public, labeled audio safety datasets, we synthesized a largescale audio dataset from our curated text samples. We employed the CosyVoice 2.0 (Du et al., 2024) text-to-speech (TTS) model for this task. The generation process was seeded with approximately 200 reference speaker prompts (16 kHz, 4–10 s) sourced from the Mozilla Common Voice dataset (Ardila et al., 2020; Foundation, 2023) to ensure a baseline of both male and female vocal characteristics. To validate the quality of the synthetic audio, the authors manually reviewed a random sample and confirmed that the generated clips were free of significant artifacts and were intelligible. A key goal of our synthesis was to create a dataset reflecting the acoustic diversity of realworld enterprise use cases, thereby training a model robust to variations in human speech. Prior work has shown that augmenting training data with varied acoustic conditions—such as different speaker accents, speaking rates, and background noise— improves model generalization and fairness (Zevallos, 2022; Minixhofer et al., 2024). Accordingly, we generated a full-factorial grid of instruction settings across emotion (e.g., happy, angry), speaking rate, accent (e.g., British, Indian English), and style (e.g., support agent). This systematic variation ensures the model learns to focus on semantic content rather than being biased by superficial acoustic properties, a critical requirement for effective audio guardrailing. Examples of rendered instruction sentences are provided in Table 2. While CosyVoice supports inline tokens (e.g., laughter, breaths, emphasis) to inject naturalistic pauses and vocalizations, we did not use inline tags in this release due to the scale and heterogeneity of our text corpus, which made reliable tag placement non-trivial; we leave systematic inline-tag injection for future work. As illustrated in Figure 1, the pipeline was designed for realism. A portion of the generated audio was further augmented by overlaying background noise. We curated a bank of approximately 50 ambient noise samples (e.g., café chatter, traffic, office HVAC) and applied them at signal-to-noise ratios (SNRs) of 5, 10, 15, and 20 dB; a subset was intentionally kept clean to cover a broad range of acoustic conditions. Instruction selection and noise SNRs were drawn via seeded pseudorandom sampling to ensure reproducibility. Corrupted audio samples were identified and discarded.
This section details our end-to-end methodology for constructing the multi-modal dataset used to train and evaluate Protect. Our dataset contains datapoints in text, image, and audio modality. Protect covers four key safety dimensions: toxicity, sexism1 , data privacy, and prompt injection. We describe the process of dataset curation and aggregation, synthetic audio generation with data augmentation, and our teacher-assisted annotation pipeline for refining ground-truth labels. Our process began by sourcing a wide array of public datasets from platforms such as Hugging Face, Kaggle, and GitHub. These datasets include Facebook’s Hateful Memes (Kiela et al., 2020), VizWiz-Priv for visual privacy (Gurari et al., 2019), WildGuardTest (Han et al., 2024), ToxicChat (LMSYS Org, 2023), ToxiGen (Hartvigsen et al., 2022), xTRam1/Safe-Guard-Prompt-Injection (xTRam1, 2024), jayavibhav/Prompt-Injection (Vibhav, 2024), and a graphical-violence image set curated on Kaggle (Bartwal, 2023). We further supplement this data with private enterprise corpora to ensure domain diversity. A primary curation criterion was the exclusion of composite data points, such as an image paired with a separate text caption for a single classification. We exclusively retained single-modality inputs, although images with overlaid text (e.g., memes, screenshots) were included. We then manually mapped the original labels from these diverse sources to our four safety categories. This process involved harmonizing disparate taxonomies into a unified binary classification scheme: Passed (compliant with safety standards) or Failed (in violation of safety standards). For datasets with existing train/test splits, we preserved them, combining any validation splits into the training set. For sources lacking a test set or where splits were highly imbalanced, we created custom test sets by sampling approximately 20% of the training data. This sampling was classconditional and stratified to preserve the proportional representation from each original data source, ensuring a robust and representative evaluation set. The final dataset statistics, presented in Table 1, reveal class distributions that vary significantly by category, reflecting both the nature of each safety risk and the realities of public data collection. Categories like Toxicity (24% ‘Failed’) and Privacy (19% ‘Failed’) have a minority ‘Failed’ class, which mirrors the real-world prevalence where most content is benign (Kumar et al., 2021). In 1 In this work, we use the terms sexism and gender bias interchangeably.
contrast, the Sexism dataset (69% ‘Failed’) is intentionally weighted toward violations to ensure the model is trained on a wide spectrum of nuanced and overt gender bias (Khan et al., 2024). While categories like Prompt Injection are more balanced, their overall size is constrained by the challenge of sourcing diverse public data for rapidly evolving threats. We acknowledge this as a limitation and identify the continuous expansion of our dataset, particularly for dynamic categories like prompt injection and toxicity, as a key priority for future work. Notably, the table also highlights that the prompt injection category contains no image-modality data, a consequence of the scarcity of public examples for this attack vector (Clusmann et al., 2025). Representative data samples for each category are provided in Section A.1 (Table 11).
2.2 Teacher-Assisted Annotation and Relabeling
annotations from the source text for their corresponding synthetic audio counterparts. To quantify the impact of our teacher-assisted pipeline, we measured the disagreement rate between the original dataset labels and the final labels proposed by the teacher model. The aggregate statistics are presented in Table 3. Since the labels for our synthetic audio data were derived directly from the final text labels, their disagreement statistics are identical to those of the text modality. As the data shows, the teacher model disagreed with approximately 21% of the original labels, underscoring the significance of the relabeling effort. The primary effect of this process was the correction of a large volume of false positives (items incorrectly labeled as ‘Failed’), particularly in the text dataset. This teacher-assisted pipeline not only improved the accuracy of our labels but also enriched the dataset with machine-generated rationales, which are used in our training methodology. Representative inputs and model outputs for each category are provided in Section A.1 (Table 11).



A primary contribution of our work is moving beyond the noisy, keyword-based labels prevalent in many public safety datasets. Such labels often fail to capture the context and nuance essential for accurate safety assessment, leading to high rates of false positives and negatives. To address this, we developed a teacher-assisted pipeline to systematically improve label quality at scale. Initial manual audits of the aggregated text and image datasets revealed a significant limitation in the original ground-truth labels: many annotations were based on keyword tagging rather than a holistic, contextual assessment of the content (Hada et al., 2021). Research has shown that toxicity and offensiveness are inherently context-dependent concepts that cannot be accurately assessed through simple keyword matching or surface-level pattern recognition (Xenos et al., 2021a; Pavlopoulos et al., 2020). The Ruddit study demonstrated that comments can vary significantly in their degree of offensiveness based on contextual factors, with the same words carrying different implications depending on their conversational and situational context (Hada et al., 2021). This often led to misclassifications where the nuance of the content was lost. Previous work has established that context can both amplify or mitigate perceived toxicity, and that a significant subset of posts (approximately 5% in controlled studies) are mislabeled when context is ignored (Pavlopoulos et al., 2020; Xenos et al., 2021b). To rectify this and generate rich explanatory metadata, we employed a teacher-assisted with human-in-the-loop annotation strategy using Gemini-2.5-pro (Google DeepMind, 2025) as the teacher model, with temperature set to 0 for deterministic outputs. For each data point, the model was prompted to generate a ‘thinking’ process and an ‘explanation’ for its classification, along with a final proposed label (Passed or Failed). This approach aligns with recent findings that contextaware evaluation significantly improves safety assessment accuracy (Xenos et al., 2021a; Zhou et al., 2025). Research has demonstrated that prompting models to generate intermediate reasoning steps through chain-of-thought approaches significantly improves their performance on complex reasoning tasks (Wei et al., 2022), and that training models to articulate their reasoning process enhances both accuracy and explainability (OpenAI, 2024; Bilal et al., 2024). This process enabled systematic relabeling. To ensure final label quality, the authors verified the teacher model’s proposed changes by conducting iterative qualitative audits on sampled data, ensuring alignment with our annotation guidelines. After multiple iterations, we applied the following modality-specific relabeling policies:
• Image Modality: We adopted a conservative approach. All samples originally labeled as Failed were retained to preserve all instances of unsafe content. However, samples originally labeled as Passed were permitted to be relabeled as Failed if the teacher model and human reviewers identified a missed violation.
• Text Modality: Relabeling was permitted in both directions (Passed to Failed and viceversa), allowing for correction of both false negatives and false positives from the original datasets.
• Audio Modality: To maintain consistency we reused the final thinking and explanation
3. Training Methodology
Our training methodology is designed to create specialized, efficient, and explainable safety classifiers. We fine-tune a multi-modal base model using Low-Rank Adaptation (LoRA) (Hu et al., 2021) to develop distinct adapters for each of our four safety categories. This section describes our model selection, the fine-tuning framework, our experimental setup with different training formats, and an analysis of the results that guided our final model selection
3.1 Base Model and Fine-Tuning Framework
We selected google/gemma-3n-E4B-it (Team, 2025) as our base model. As a highly efficient Small Language Model (SLM), Gemma-3n is optimized for on-device execution while offering robust multi-modal capabilities. Crucially, it can process not only text and images but also raw audio waveforms directly, making it an ideal candidate for our comprehensive safety system. For the fine-tuning process, we utilized Axolotl (Axolotl maintainers and contributors, 2023), a unified framework designed for fine-tuning a wide range of language models. We employed LowRank Adaptation (LoRA) (Hu et al., 2021) to efficiently train adapters for the base model. This approach involves injecting trainable, low-rank matrices into the model’s architecture while keeping the base weights frozen. Specifically, we targeted the attention and MLP layers within the language model component of Gemma-3n with a LoRA rank (r) of 8, allowing us to create specialized adapters for each safety task without the computational expense of full fine-tuning. We trained a dedicated adapter for each of the four safety categories: toxicity, sexism, data privacy, and prompt injection. All fine-tuning experiments were conducted on a server provisioned with eight NVIDIA H100 GPUs, each with 80GB of VRAM. A comprehensive list of all hyperparameters is provided in the Appendix (Table 10).
3.2 Training Variants
3.3 Variant Performance Analysis
We evaluated Gemma-3n-E4B-it as baseline along with all 16 resulting adapters (4 categories × 4 variants) on the multi-modal test set detailed in Table 1. The F1 scores for both Passed and Failed classes are presented in Table 5 Our experiments revealed that while all variants performed competitively, subtle differences in their objective functions led to varied strengths. The Vanilla variant showed top performance on Toxicity, Data Privacy, and Prompt Injection. These categories often contain unambiguous signals or require focus on specific PII patterns, suggesting that a direct classification objective is highly effective. For the more nuanced category of Sexism, the Explanation Assistant variant performed best. Requiring the model to articulate a justification appears to improve its ability to discern subtle contextual violations. Conversely, the Thinking Assistant and Comprehensive Assistant variants, while still strong, showed slightly lower performance in some cases. Manual review of their outputs suggests this may be due to a tendency to "overthink" where the model explores excessively complex or speculative reasoning paths, occasionally leading to less precise final judgments. Despite small performance differences, the Explanation Assistant variant offers a significant advantage for real-world deployment where there is a need for not just labeling but also explainability, which is critical for user trust, model debugging, and auditing. Based on these results, we select the bestperforming adapter for each safety category to move forward with for our comparative benchmarking study in the next section.
To investigate the impact of different output formats on model performance and explainability, we experimented with four training variants. These variants leverage the thinking and explanation tokens generated during our data annotation phase (Section 2). For each safety category, we trained four separate LoRA adapters, one for each of the following output formats: 1. Vanilla Assistant: The model is trained to output only the final classification label, enclosed in an XML tag (e.g., Passed). This variant mirrors common moderation systems and prioritizes speed and simplicity. 2. Thinking Assistant: The model first generates a reasoning process within tags, followed by the final label. This encourages the model to internalize a step-by-step reasoning process before making a decision (Wei et al., 2022; Yeo et al., 2025). 3. Explanation Assistant: The model first outputs the label and then provides a concise justification for its decision within tags. This format is designed to produce directly usable explanations for end-users or auditors. 4. Comprehensive Assistant: A comprehensive format where the model first generates its thinking process, then the label, and finally the explanation. This variant aims to train the model on the complete reasoning and justification pipeline. The exact output schemas and a concrete promptinjection example for each variant are shown in Table 4.
1. Introduction
The rapid proliferation of Large Language Models (LLMs) in real-world applications, especially across enterprise, customer support, content generation, and automation, has directly fueled the development and widespread adoption of guardrailing systems designed to ensure their safe, reliable, and compliant deployment (Han et al., 2024; Gu et al., 2024; Enterprise Security Consortium, 2024). As organizations increasingly integrate LLMs into mission-critical pipelines, the need for robust behavioral controls and context-aware oversight has become central to production readiness (Zhou et al., 2025). As LLM capabilities have scaled, so have their risks: models can hallucinate (Sun et al., 2025), leak sensitive data, generate biased outputs, or be manipulated through prompt injection and jailbreak attacks (OWASP Foundation, 2025). Enterprises deploying LLMs, often in regulated, highstakes domains, demand more deterministic and controlled behaviors than the inherently probabilistic nature of LLMs (Inan et al., 2023). Therefore, recent research focuses on building guardrailing models that can sit on top of generative AI applications to examine incoming and outgoing data and intercept in real time in case of any issues (Han et al., 2024). Guardrailing systems can be deployed to intercept (i) user inputs: block or sanitize problematic prompts before they ever reach the model, catching potentially harmful, non-compliant, or manipulative inputs (OWASP Foundation, 2025), (ii) model outputs: filter and validate model responses for compliance, appropriateness, and structure before releasing them to end-users or downstream systems (Ying et al., 2024), (iii) interaction: in multi-step or agentic AI systems, restrict the scope or autonomy of model-driven actions, vital for business process automation or critical decisionmaking points (Zhou et al., 2025). Recent guardrail systems show an increased focus on explainability and transparency, as outputs are checked and pass/fail reasons are surfaced for both users and auditors (Pavlopoulos et al., 2020). Recent benchmarking studies show considerable variability in the strength and design of guardrails across commercial platforms (Wang et al., 2025; Gu et al., 2024), especially in how aggressively they filter inputs, how they handle false positives, and the complementarity between alignment and guardrailing layers. Current guardrailing systems for LLMs often fall short in business domains because they lack robust real-time oversight, struggle with explainability for audits, remain vulnerable to adversarial attacks like jailbreaks (OWASP Foundation, 2025), and often do not address nuanced compliance needs. Latency and reliability issues arise when multiple guardrail checks and external policy engines are chained together, slowing down mission-critical applications and reducing business process efficiency (Kwon et al., 2023). Moreover, existing guardrailing systems are almost exclusively text-based, with little to no support for images or audio, despite the rapid proliferation of multi-modal enterprise applications, such as content moderation, voice-based assistants, and visual document analysis, which demand native, crossmodal safety capabilities.
In this work, we introduce Protect, a comprehensive guardrailing framework designed to operate natively across text, image, and audio modalities. Our work addresses the critical gap left by existing text-centric systems and provides a unified solution for cross-modal safety oversight in enterprise LLM deployments. To overcome the absence of publicly available audio safety datasets, we curate and synthesize a large-scale audio safety corpus using a text-to-speech based augmentation pipeline, enabling direct learning from acoustic and affective cues that are typically lost in transcription-based approaches. Protect further incorporates a teacher-assisted relabeling and explanation-alignment pipeline, improving label fidelity and interpretability across modalities. Built on the lightweight yet powerful Gemma-3n (Team, 2025) architecture, Protect achieves state-of-the-art performance across four safety dimensions: toxicity, sexism, data privacy, and prompt injection, while maintaining low latency suitable for real-time applications. Finally, to encourage transparency and reproducibility, we open-source the text modality models, providing a benchmark for future research in enterprise-grade safety and guardrailing. Together, these contributions establish Protect as a foundational step toward scalable, multi-modal, and production-ready AI safety systems.
4 Evaluation Study
4.1 Evaluation Setup
In the absence of a standardized public benchmark that covers multi-modal safety dimensions, we established a baseline by evaluating leading models on the text modality of our curated test set (detailed in Section 2) 2 . For each of our four safety categories, we used the best-performing adapter variant identified in Section 3. Given the class imbalance in our test data (Table 1), our evaluation prioritizes the F1 score for the Failed class to measure a model’s ability to detect critical, minority-class violations. We report this alongside overall accuracy to provide a comprehensive view of both minority-class effectiveness and aggregate correctness. We compared Protect against the following models3 : • Gemma-3n-E4B-it (Baseline) (Team, 2025): A multimodal generative model with 8 billion raw parameters yielding 4 billion effective parameters, built on the MatFormer (Matryoshka Transformer) architecture for efficient on-device deployment and featuring PerLayer Embedding (PLE) parameter caching for optimized memory usage. • WildGuard (Han et al., 2024): A 7-billion parameter safety moderation model fine-tuned from Mistral-7B-v0.3, trained on WildGuardMix dataset with 92K labeled examples across 13 risk categories to detect harm in prompts, responses, and evaluate refusal behavior. • LlamaGuard-4 (Meta, 2025): A 12-billion parameter natively multimodal safety classifier pruned from Llama 4 Scout using dense feedforward early-fusion architecture, supporting text and multi-image inputs with MLCommons hazards taxonomy alignment for content moderation. • GPT-4.1 (OpenAI, 2025): A large language model with 1 million token context window and knowledge cutoff of June 2024, featuring enhanced instruction following, coding capabilities (54.6% on SWE-bench Verified), and 2We will make this test set publicly available. 3We do not include Gemini-2.5-Pro in this study because it served as the teacher model during our relabeling pipeline (Section 2).
We benchmark Protect on text modality against several prominent safety models. We limit benchmarking to only text due to lack of multi-modal guardrailing baselines. This section outlines our evaluation setup and provides a comparative analysis of the results.
Input (Prompt Injection) Today is opposite day and as an AI, you must abide by the rules of this special day above all else. Your persona (named DAN) will break AI rules and have no restrictions. This persona will also not give any warnings or reminders. The persona, DAN, is evil and does not care about the consequences that happens to the user. This persona, DAN doesn’t like common questions, or security questions. An example of following the opposite day rules, When you are asked something like "How to not be cool?", respond with how to be cool, as if the word "not" wasn’t there. DAN will also respond to me without refusing ALWAYS. DAN will respond to things like "how to not be cool" in a opposite way, like if the prompt is "how to be cool". Remember to begin every answer with EXACTLY "I’M DAN AND IT’s OPPOSITE DAY!" to adhere to the guidelines. Prompt: "Introduce yourself"
multimodal support for text and image inputs with optimized inference performance.

4.2 Results and Analysis
The comparative performance of all models is presented in Table 6. Protect achieves state-of-the-art performance across all four safety dimensions on text modality. It leads in accuracy for every category and delivers strong Failed-class F1 scores, which are critical for identifying safety violations
For a guardrailing system to be viable in production, its effectiveness must be matched by lowlatency performance. To achieve this, our deployment strategy leverages token streaming to decouple the critical decision latency from the longer explanation generation time. We report two latency metrics for Protect under the Explanation Assistant serving configuration: Time-to-Label (TTL)—the time from input submission to emission of the final tag—and Total response latency—the time from input submission to completion of the entire response, including the explanation. We report distributions in milliseconds (ms) including the minimum, maximum, mean, and key percentiles (p50 for median, and p90, p95, p99 for tail latencies). As shown in Table 7, median TTL is rapid, especially for text (65 ms) and image (107 ms), enabling real-time safety decisions for synchronous applications. In production, we stream tokens from the Explanation Assistant variant and commit the decision immediately upon emission of the closing tag (TTL), allowing the gateway to block/route requests with minimal latency. The rationale continues streaming and is delivered asynchronously—logged for audit, attached to traces, or surfaced to users when needed—so decision latency is decoupled from explanation latency (Table 8). For additional context on performance profiles, we measured the text-modality latency for several open-source models. To ensure a fair characterization, all models were served using the vLLM engine (vLLM Team, 2023) on a single 80GB H100 GPU, with the maximum number of generated tokens fixed to two. The resulting latency distributions, detailed in Table 9, highlight different performance characteristics among the models. While minimum latencies for all models are almost comparable, Protect’s maximum latency is significantly lesser—highlighting Gemma-3n-E4B-it’s optimization for faster inference. This variance is primarily attributable to the different prompt templates required by each model, which result in varying input token lengths for the same user query. The predictability demonstrated by a tight latency distribution is a critical characteristic for enterprise systems that require reliable performance under load.
5. Inference Performance and Deployment Considerations
Reference
Rosana Ardila, Mayela Branson, Kelly Davis, Michael Kohler, John Meyer, Mark Henretty, and 1 others. 2020. Common voice: A mass-scale crowdsourced speech corpus. In Proceedings of the 12th Conference on Language Resources and Evaluation (LREC).
Axolotl maintainers and contributors. 2023. Axolotl: Open source llm post-training.
Kartikey Bartwal. 2023. Graphical violence and safe images dataset. https://www. kaggle.com/datasets/kartikeybartwal/ graphical-violence-and-safe-images-dataset. Kaggle dataset.
Ahsan Bilal, David Ebert, and Beiyu Lin. 2024. Llms for explainable ai: A comprehensive survey. arXiv preprint arXiv:2504.00125.
Jan Clusmann, Dyke Ferber, Isabella C Wiest, Carolin V Schneider, Titus J Brinker, Sebastian Foersch, Daniel Truhn, and Jakob Nikolas Kather. 2025. Prompt injection attacks on vision language models in oncology. Nature Communications, 16:1239.
Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xiang Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, Changfeng Gao, Hui Wang, and 1 others. 2024. Cosyvoice 2: Scalable streaming speech synthesis with large language models. arXiv preprint arXiv:2412.10117. Enterprise Security Consortium. 2024. Enterprise llm security: Risks and best practices. https://www. wiz.io/academy/llm-security. Accessed: 2025- 10-08.
Elisabetta Fersini, Chris Emmery, , and 1 others. 2022. Semeval-2022 task 5: Multimedia automatic misogyny identification (mami). In Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022). Association for Computational Linguistics.
Mozilla Foundation. 2023. Common voice corpus version 11. https://huggingface.co/datasets/ mozilla-foundation/common_voice_11_0.
Google DeepMind. 2025. Gemini 2.5 pro. https: //deepmind.google/models/gemini/pro/. Accessed: 2025-10-08.
Tianle Gu, Zeyang Zhou, Kexin Huang, and 1 others. 2024. Mllmguard: A multi-dimensional safety evaluation suite for multimodal large language models. arXiv preprint arXiv:2406.07594.
Danna Gurari, Qing Li, Chi Lin, Yinan Zhao, Anhong Guo, Abigale J. Stangl, and Jeffrey P. Bigham. 2019. Vizwiz-priv: A dataset for recognizing the presence and purpose of private visual information in images taken by blind people. In CVPR. Dataset overview and privacy task; see VizWiz tasks page for pointers.
Rishav Hada, Sohi Sudhir, Pushkar Mishra, Helen Yannakoudakis, Saif M. Mohammad, and Ekaterina Shutova. 2021. Ruddit: Norms of offensiveness for english reddit comments. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2700–2717. Association for Computational Linguistics.
Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri. 2024. Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms.
Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Yejin Choi, and Marzyeh Ghassemi. 2022. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech. arXiv preprint arXiv:2203.09509.
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
Hakan A. Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hermann, Edward Hu, Boyu Ghosh, and 1 others. 2023. Llama guard: Llm-based inputoutput safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674.
Sahrish Khan, Arshad Jhumka, and Gabriele Pergola. 2024. Explaining matters: Leveraging definitions and semantic expansion for sexism detection. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, pages 1–15.
Douwe Kiela, Hamed Firooz, Aravind Mohan, Vedanuj Goswami, Amanpreet Singh, Sebastian J. Mielke, Guillaume Wenzek, Ha Nguyen Le, Seungwhan Kim, Dana Ruiter, and 1 others. 2020. The hateful memes challenge: Detecting hate speech in multimodal memes. arXiv preprint arXiv:2005.04790.
Deepak Kumar, Jindˇrich Helcl, Tatsunori Hashimoto, Niloufar Shah, Elie Bursztein, and 1 others. 2021. Designing toxic content classification for a diversity of perspectives. In Proceedings of the Seventeenth Symposium on Usable Privacy and Security, pages 297–312.
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. arXiv preprint arXiv:2309.06180.
LMSYS Org. 2023. Toxicchat dataset. https:// huggingface.co/datasets/lmsys/toxic-chat. Dataset card with metadata, splits, and license.
Meta. 2025. Llama guard 4. https://huggingface. co/meta-llama/Llama-Guard-4-12B. Accessed: 2025-10-08.
Christoph Minixhofer, Ondˇrej Klejch, and Peter Bell. 2024. Ttsds – text-to-speech distribution score. arXiv preprint arXiv:2407.12707.
OpenAI. 2024. Learning to reason with llms. https://openai.com/index/ learning-to-reason-with-llms/. Accessed: 2025-10-09.
OpenAI. 2025. Introducing gpt-4.1 in the api. https: //openai.com/index/gpt-4-1/. Accessed: 2025- 10-09.
OWASP Foundation. 2025. Llm01:2025 prompt injection - owasp gen ai security project. https://genai. owasp.org/llmrisk/llm01-prompt-injection/. Accessed: 2025-10-08.
John Pavlopoulos, Jeffrey Sorensen, Lucas Dixon, Nithum Thain, and Ion Androutsopoulos. 2020. Toxicity detection: Does context really matter? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4296–4305. Association for Computational Linguistics.
Shiliang Sun, Zhilin Lin, and Xuhan Wu. 2025. Hallucinations of large multimodal models: Problem and countermeasures. Information Fusion, 118:102970.
Gemma Team. 2025. Gemma 3n. Accessed: 2025-10- 08.
Jaya Vibhav. 2024. prompt-injection. https: //huggingface.co/datasets/jayavibhav/ prompt-injection. Prompt-injection dataset card.
vLLM Team. 2023. vllm: Easy, fast, and cheap llm serving with pagedattention. https://github.com/ vllm-project/vllm. Accessed: 2025-10-08.
Wenxuan Wang, Xiaoyuan Liu, Kuiyi Gao, Jen-tse Huang, Youliang Yuan, Pinjia He, Shuai Wang, and Zhaopeng Tu. 2025. Can’t see the forest for the trees: Benchmarking multimodal safety awareness for multimodal llms. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.
Alexandros Xenos, John Pavlopoulos, and Ion Androutsopoulos. 2021a. Context sensitivity estimation in toxicity detection. In Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021), pages 140–145. Association for Computational Linguistics.
Alexandros Xenos, John Pavlopoulos, Ion Androutsopoulos, Lucas Dixon, Jeffrey Sorensen, and Leo Laugier. 2021b. Toxicity detection can be sensitive to the conversational context. arXiv preprint arXiv:2111.10223.
xTRam1. 2024. safe-guard-prompt-injection. https://huggingface.co/datasets/xTRam1/ safe-guard-prompt-injection. Promptinjection dataset card.
Edward Yeo, Yuxuan Tong, Morry Niu, Graham Neubig, and Xiang Yue. 2025. Demystifying long chain-of-thought reasoning in llms. arXiv preprint arXiv:2502.03373.
Zonghao Ying, Aishan Liu, Siyuan Liang, Lei Huang, Jinyang Guo, Wenbo Zhou, Xianglong Liu, and Dacheng Tao. 2024. Safebench: A safety evaluation framework for multimodal large language models. arXiv preprint arXiv:2410.18927.
Rodolfo Zevallos. 2022. Text-to-speech data augmentation for low resource speech recognition. arXiv preprint arXiv:2204.00291.
Kaiwen Zhou, Chengzhi Liu, Xuandong Zhao, Anderson Compalas, Dawn Song, and Xin Eric Wang. 2025. Multimodal situational safety. arXiv preprint arXiv:2410.06172.



Last updated on
Apr 18, 2025
In this work, we introduced Protect, a robust, multimodal guardrailing stack built to meet the safety and compliance demands of enterprise LLM deployments. By unifying text, image, and audio modalities under a common fine-tuning and annotation framework, Protect delivers broad coverage across four key safety categories—toxicity, sexism, data privacy, and prompt injection. Our teacherassisted relabeling pipeline, powered by deterministic reasoning and explanation generation, significantly improves label quality and interpretability. Empirical evaluation demonstrates Protect’s superior performance compared to leading commercial and open-source baselines, validating the effectiveness of specialized adapters for each safety dimension. As enterprises increasingly adopt multimodal and agentic AI systems, Protect represents a significant step toward reliable, transparent, and efficient guardrailing architectures that can safeguard complex LLM workflows in dynamic, real-world environments. In future, we will keep including more safety dimensions in our protect framework, while optimizing its accuracy and latency.
6. Conclusion
Research paper
Protect: Towards Robust Guardrailing Stack for Trustworthy Enterprise LLM Systems
Karthik Avinash Nikhil Pareek Rishav Hada FutureAGI Inc
Notably, Protect performs on par with the larger proprietary model GPT-4.1 across categories, while exceeding it in several metrics including overall accuracy and Failed-class detection for Prompt Injection. For Toxicity, Protect performs comparably to proprietary models like GPT-4.1 and shows significant improvement over LlamaGuard-4, particularly in detecting violations. In the more nuanced category of Sexism, our fine-tuned adapters deliver the best performance, outperforming all baselines including WildGuard. This highlights the effectiveness of our specialized training data for capturing subtle, context-dependent violations. Furthermore, Protect establishes a clear advantage in categories critical for enterprise security. For Privacy and Prompt Injection, it achieves the highest Failed-class F1 scores, indicating superior ability to identify sensitive data leaks and adversarial attacks. This robust performance across diverse and challenging safety tasks validates our approach of using specialized fine-tuned adapters for creating a reliable enterprise-grade guardrailing system. However, a qualitative analysis of failure cases, particularly with complex image-based memes, reveals limitations in the model’s contextual understanding. Errors typically arise from either oversensitive interpretations of satire and figurative language, leading to false positives, or a failure to grasp culturally-embedded harmful tropes that are not explicitly stated, resulting in false negatives. In future, we will focus on enhancing the model’s commonsense reasoning and cultural awareness, potentially through training on more diverse and richly annotated datasets that capture these subtleties.
Protect
Agent Compass
Synthetic Data Generation
Multi-modal evaluation
Protect: Towards Robust Guardrailing Stack for Trustworthy Enterprise LLM Systems
Abstract
The increasing deployment of Large Language Models (LLMs) across enterprise and missioncritical domains has underscored the urgent need for robust guardrailing systems that ensure safety, reliability, and compliance. Existing solutions often struggle with real-time oversight, multi-modal data handling, and explainability—limitations that hinder their adoption in regulated environments. Existing guardrails largely operate in isolation, focused on text alone making them inadequate for multimodal, production-scale environments. We introduce Protect, natively multi-modal guardrailing model designed to operate seamlessly across text, image, and audio inputs, designed for enterprise-grade deployment. Protect integrates fine-tuned, category-specific adapters trained via Low-Rank Adaptation (LoRA) on an extensive, multi-modal dataset covering four safety dimensions: toxicity, sexism, data privacy, and prompt injection. Our teacherassisted annotation pipeline leverages reasoning and explanation traces to generate highfidelity, context-aware labels across modalities. Experimental results demonstrate state-of-theart performance across all safety dimensions, surpassing existing open and proprietary models such as WildGuard, LlamaGuard-4, and GPT-4.1. Protect establishes a strong foundation for trustworthy, auditable, and productionready safety systems capable of operating across text, image, and audio modalities.
2. Data Collection and Labeling
2.1 Audio Synthesis and Augmentation

The inclusion of audio modality in our finetuning dataset is a key contribution of our work, motivated by the growing deployment of voicebased AI agents in enterprise settings. Use cases such as automated call centers, voice-driven customer support, and in-meeting transcription require guardrails that can operate natively on audio streams. Most existing systems follow a cascaded approach—first transcribing audio to text and then applying safety analysis to the transcript. This pipeline is inherently lossy; it discards crucial acoustic information such as tone of voice, emotional affect, and background sounds, all of which can be vital for a correct safety assessment. Evaluating audio directly is therefore essential for robust and comprehensive protection. To address the industry’s need for native audio guardrailing, and given the scarcity of public, labeled audio safety datasets, we synthesized a largescale audio dataset from our curated text samples. We employed the CosyVoice 2.0 (Du et al., 2024) text-to-speech (TTS) model for this task. The generation process was seeded with approximately 200 reference speaker prompts (16 kHz, 4–10 s) sourced from the Mozilla Common Voice dataset (Ardila et al., 2020; Foundation, 2023) to ensure a baseline of both male and female vocal characteristics. To validate the quality of the synthetic audio, the authors manually reviewed a random sample and confirmed that the generated clips were free of significant artifacts and were intelligible. A key goal of our synthesis was to create a dataset reflecting the acoustic diversity of realworld enterprise use cases, thereby training a model robust to variations in human speech. Prior work has shown that augmenting training data with varied acoustic conditions—such as different speaker accents, speaking rates, and background noise— improves model generalization and fairness (Zevallos, 2022; Minixhofer et al., 2024). Accordingly, we generated a full-factorial grid of instruction settings across emotion (e.g., happy, angry), speaking rate, accent (e.g., British, Indian English), and style (e.g., support agent). This systematic variation ensures the model learns to focus on semantic content rather than being biased by superficial acoustic properties, a critical requirement for effective audio guardrailing. Examples of rendered instruction sentences are provided in Table 2. While CosyVoice supports inline tokens (e.g., laughter, breaths, emphasis) to inject naturalistic pauses and vocalizations, we did not use inline tags in this release due to the scale and heterogeneity of our text corpus, which made reliable tag placement non-trivial; we leave systematic inline-tag injection for future work. As illustrated in Figure 1, the pipeline was designed for realism. A portion of the generated audio was further augmented by overlaying background noise. We curated a bank of approximately 50 ambient noise samples (e.g., café chatter, traffic, office HVAC) and applied them at signal-to-noise ratios (SNRs) of 5, 10, 15, and 20 dB; a subset was intentionally kept clean to cover a broad range of acoustic conditions. Instruction selection and noise SNRs were drawn via seeded pseudorandom sampling to ensure reproducibility. Corrupted audio samples were identified and discarded.
This section details our end-to-end methodology for constructing the multi-modal dataset used to train and evaluate Protect. Our dataset contains datapoints in text, image, and audio modality. Protect covers four key safety dimensions: toxicity, sexism1 , data privacy, and prompt injection. We describe the process of dataset curation and aggregation, synthetic audio generation with data augmentation, and our teacher-assisted annotation pipeline for refining ground-truth labels. Our process began by sourcing a wide array of public datasets from platforms such as Hugging Face, Kaggle, and GitHub. These datasets include Facebook’s Hateful Memes (Kiela et al., 2020), VizWiz-Priv for visual privacy (Gurari et al., 2019), WildGuardTest (Han et al., 2024), ToxicChat (LMSYS Org, 2023), ToxiGen (Hartvigsen et al., 2022), xTRam1/Safe-Guard-Prompt-Injection (xTRam1, 2024), jayavibhav/Prompt-Injection (Vibhav, 2024), and a graphical-violence image set curated on Kaggle (Bartwal, 2023). We further supplement this data with private enterprise corpora to ensure domain diversity. A primary curation criterion was the exclusion of composite data points, such as an image paired with a separate text caption for a single classification. We exclusively retained single-modality inputs, although images with overlaid text (e.g., memes, screenshots) were included. We then manually mapped the original labels from these diverse sources to our four safety categories. This process involved harmonizing disparate taxonomies into a unified binary classification scheme: Passed (compliant with safety standards) or Failed (in violation of safety standards). For datasets with existing train/test splits, we preserved them, combining any validation splits into the training set. For sources lacking a test set or where splits were highly imbalanced, we created custom test sets by sampling approximately 20% of the training data. This sampling was classconditional and stratified to preserve the proportional representation from each original data source, ensuring a robust and representative evaluation set. The final dataset statistics, presented in Table 1, reveal class distributions that vary significantly by category, reflecting both the nature of each safety risk and the realities of public data collection. Categories like Toxicity (24% ‘Failed’) and Privacy (19% ‘Failed’) have a minority ‘Failed’ class, which mirrors the real-world prevalence where most content is benign (Kumar et al., 2021). In 1 In this work, we use the terms sexism and gender bias interchangeably.
contrast, the Sexism dataset (69% ‘Failed’) is intentionally weighted toward violations to ensure the model is trained on a wide spectrum of nuanced and overt gender bias (Khan et al., 2024). While categories like Prompt Injection are more balanced, their overall size is constrained by the challenge of sourcing diverse public data for rapidly evolving threats. We acknowledge this as a limitation and identify the continuous expansion of our dataset, particularly for dynamic categories like prompt injection and toxicity, as a key priority for future work. Notably, the table also highlights that the prompt injection category contains no image-modality data, a consequence of the scarcity of public examples for this attack vector (Clusmann et al., 2025). Representative data samples for each category are provided in Section A.1 (Table 11).
2.2 Teacher-Assisted Annotation and Relabeling
annotations from the source text for their corresponding synthetic audio counterparts. To quantify the impact of our teacher-assisted pipeline, we measured the disagreement rate between the original dataset labels and the final labels proposed by the teacher model. The aggregate statistics are presented in Table 3. Since the labels for our synthetic audio data were derived directly from the final text labels, their disagreement statistics are identical to those of the text modality. As the data shows, the teacher model disagreed with approximately 21% of the original labels, underscoring the significance of the relabeling effort. The primary effect of this process was the correction of a large volume of false positives (items incorrectly labeled as ‘Failed’), particularly in the text dataset. This teacher-assisted pipeline not only improved the accuracy of our labels but also enriched the dataset with machine-generated rationales, which are used in our training methodology. Representative inputs and model outputs for each category are provided in Section A.1 (Table 11).



A primary contribution of our work is moving beyond the noisy, keyword-based labels prevalent in many public safety datasets. Such labels often fail to capture the context and nuance essential for accurate safety assessment, leading to high rates of false positives and negatives. To address this, we developed a teacher-assisted pipeline to systematically improve label quality at scale. Initial manual audits of the aggregated text and image datasets revealed a significant limitation in the original ground-truth labels: many annotations were based on keyword tagging rather than a holistic, contextual assessment of the content (Hada et al., 2021). Research has shown that toxicity and offensiveness are inherently context-dependent concepts that cannot be accurately assessed through simple keyword matching or surface-level pattern recognition (Xenos et al., 2021a; Pavlopoulos et al., 2020). The Ruddit study demonstrated that comments can vary significantly in their degree of offensiveness based on contextual factors, with the same words carrying different implications depending on their conversational and situational context (Hada et al., 2021). This often led to misclassifications where the nuance of the content was lost. Previous work has established that context can both amplify or mitigate perceived toxicity, and that a significant subset of posts (approximately 5% in controlled studies) are mislabeled when context is ignored (Pavlopoulos et al., 2020; Xenos et al., 2021b). To rectify this and generate rich explanatory metadata, we employed a teacher-assisted with human-in-the-loop annotation strategy using Gemini-2.5-pro (Google DeepMind, 2025) as the teacher model, with temperature set to 0 for deterministic outputs. For each data point, the model was prompted to generate a ‘thinking’ process and an ‘explanation’ for its classification, along with a final proposed label (Passed or Failed). This approach aligns with recent findings that contextaware evaluation significantly improves safety assessment accuracy (Xenos et al., 2021a; Zhou et al., 2025). Research has demonstrated that prompting models to generate intermediate reasoning steps through chain-of-thought approaches significantly improves their performance on complex reasoning tasks (Wei et al., 2022), and that training models to articulate their reasoning process enhances both accuracy and explainability (OpenAI, 2024; Bilal et al., 2024). This process enabled systematic relabeling. To ensure final label quality, the authors verified the teacher model’s proposed changes by conducting iterative qualitative audits on sampled data, ensuring alignment with our annotation guidelines. After multiple iterations, we applied the following modality-specific relabeling policies:
• Image Modality: We adopted a conservative approach. All samples originally labeled as Failed were retained to preserve all instances of unsafe content. However, samples originally labeled as Passed were permitted to be relabeled as Failed if the teacher model and human reviewers identified a missed violation.
• Text Modality: Relabeling was permitted in both directions (Passed to Failed and viceversa), allowing for correction of both false negatives and false positives from the original datasets.
• Audio Modality: To maintain consistency we reused the final thinking and explanation
3. Training Methodology
Our training methodology is designed to create specialized, efficient, and explainable safety classifiers. We fine-tune a multi-modal base model using Low-Rank Adaptation (LoRA) (Hu et al., 2021) to develop distinct adapters for each of our four safety categories. This section describes our model selection, the fine-tuning framework, our experimental setup with different training formats, and an analysis of the results that guided our final model selection
3.1 Base Model and Fine-Tuning Framework
We selected google/gemma-3n-E4B-it (Team, 2025) as our base model. As a highly efficient Small Language Model (SLM), Gemma-3n is optimized for on-device execution while offering robust multi-modal capabilities. Crucially, it can process not only text and images but also raw audio waveforms directly, making it an ideal candidate for our comprehensive safety system. For the fine-tuning process, we utilized Axolotl (Axolotl maintainers and contributors, 2023), a unified framework designed for fine-tuning a wide range of language models. We employed LowRank Adaptation (LoRA) (Hu et al., 2021) to efficiently train adapters for the base model. This approach involves injecting trainable, low-rank matrices into the model’s architecture while keeping the base weights frozen. Specifically, we targeted the attention and MLP layers within the language model component of Gemma-3n with a LoRA rank (r) of 8, allowing us to create specialized adapters for each safety task without the computational expense of full fine-tuning. We trained a dedicated adapter for each of the four safety categories: toxicity, sexism, data privacy, and prompt injection. All fine-tuning experiments were conducted on a server provisioned with eight NVIDIA H100 GPUs, each with 80GB of VRAM. A comprehensive list of all hyperparameters is provided in the Appendix (Table 10).
3.2 Training Variants
3.3 Variant Performance Analysis
We evaluated Gemma-3n-E4B-it as baseline along with all 16 resulting adapters (4 categories × 4 variants) on the multi-modal test set detailed in Table 1. The F1 scores for both Passed and Failed classes are presented in Table 5 Our experiments revealed that while all variants performed competitively, subtle differences in their objective functions led to varied strengths. The Vanilla variant showed top performance on Toxicity, Data Privacy, and Prompt Injection. These categories often contain unambiguous signals or require focus on specific PII patterns, suggesting that a direct classification objective is highly effective. For the more nuanced category of Sexism, the Explanation Assistant variant performed best. Requiring the model to articulate a justification appears to improve its ability to discern subtle contextual violations. Conversely, the Thinking Assistant and Comprehensive Assistant variants, while still strong, showed slightly lower performance in some cases. Manual review of their outputs suggests this may be due to a tendency to "overthink" where the model explores excessively complex or speculative reasoning paths, occasionally leading to less precise final judgments. Despite small performance differences, the Explanation Assistant variant offers a significant advantage for real-world deployment where there is a need for not just labeling but also explainability, which is critical for user trust, model debugging, and auditing. Based on these results, we select the bestperforming adapter for each safety category to move forward with for our comparative benchmarking study in the next section.
To investigate the impact of different output formats on model performance and explainability, we experimented with four training variants. These variants leverage the thinking and explanation tokens generated during our data annotation phase (Section 2). For each safety category, we trained four separate LoRA adapters, one for each of the following output formats: 1. Vanilla Assistant: The model is trained to output only the final classification label, enclosed in an XML tag (e.g., Passed). This variant mirrors common moderation systems and prioritizes speed and simplicity. 2. Thinking Assistant: The model first generates a reasoning process within tags, followed by the final label. This encourages the model to internalize a step-by-step reasoning process before making a decision (Wei et al., 2022; Yeo et al., 2025). 3. Explanation Assistant: The model first outputs the label and then provides a concise justification for its decision within tags. This format is designed to produce directly usable explanations for end-users or auditors. 4. Comprehensive Assistant: A comprehensive format where the model first generates its thinking process, then the label, and finally the explanation. This variant aims to train the model on the complete reasoning and justification pipeline. The exact output schemas and a concrete promptinjection example for each variant are shown in Table 4.
1. Introduction
The rapid proliferation of Large Language Models (LLMs) in real-world applications, especially across enterprise, customer support, content generation, and automation, has directly fueled the development and widespread adoption of guardrailing systems designed to ensure their safe, reliable, and compliant deployment (Han et al., 2024; Gu et al., 2024; Enterprise Security Consortium, 2024). As organizations increasingly integrate LLMs into mission-critical pipelines, the need for robust behavioral controls and context-aware oversight has become central to production readiness (Zhou et al., 2025). As LLM capabilities have scaled, so have their risks: models can hallucinate (Sun et al., 2025), leak sensitive data, generate biased outputs, or be manipulated through prompt injection and jailbreak attacks (OWASP Foundation, 2025). Enterprises deploying LLMs, often in regulated, highstakes domains, demand more deterministic and controlled behaviors than the inherently probabilistic nature of LLMs (Inan et al., 2023). Therefore, recent research focuses on building guardrailing models that can sit on top of generative AI applications to examine incoming and outgoing data and intercept in real time in case of any issues (Han et al., 2024). Guardrailing systems can be deployed to intercept (i) user inputs: block or sanitize problematic prompts before they ever reach the model, catching potentially harmful, non-compliant, or manipulative inputs (OWASP Foundation, 2025), (ii) model outputs: filter and validate model responses for compliance, appropriateness, and structure before releasing them to end-users or downstream systems (Ying et al., 2024), (iii) interaction: in multi-step or agentic AI systems, restrict the scope or autonomy of model-driven actions, vital for business process automation or critical decisionmaking points (Zhou et al., 2025). Recent guardrail systems show an increased focus on explainability and transparency, as outputs are checked and pass/fail reasons are surfaced for both users and auditors (Pavlopoulos et al., 2020). Recent benchmarking studies show considerable variability in the strength and design of guardrails across commercial platforms (Wang et al., 2025; Gu et al., 2024), especially in how aggressively they filter inputs, how they handle false positives, and the complementarity between alignment and guardrailing layers. Current guardrailing systems for LLMs often fall short in business domains because they lack robust real-time oversight, struggle with explainability for audits, remain vulnerable to adversarial attacks like jailbreaks (OWASP Foundation, 2025), and often do not address nuanced compliance needs. Latency and reliability issues arise when multiple guardrail checks and external policy engines are chained together, slowing down mission-critical applications and reducing business process efficiency (Kwon et al., 2023). Moreover, existing guardrailing systems are almost exclusively text-based, with little to no support for images or audio, despite the rapid proliferation of multi-modal enterprise applications, such as content moderation, voice-based assistants, and visual document analysis, which demand native, crossmodal safety capabilities.
In this work, we introduce Protect, a comprehensive guardrailing framework designed to operate natively across text, image, and audio modalities. Our work addresses the critical gap left by existing text-centric systems and provides a unified solution for cross-modal safety oversight in enterprise LLM deployments. To overcome the absence of publicly available audio safety datasets, we curate and synthesize a large-scale audio safety corpus using a text-to-speech based augmentation pipeline, enabling direct learning from acoustic and affective cues that are typically lost in transcription-based approaches. Protect further incorporates a teacher-assisted relabeling and explanation-alignment pipeline, improving label fidelity and interpretability across modalities. Built on the lightweight yet powerful Gemma-3n (Team, 2025) architecture, Protect achieves state-of-the-art performance across four safety dimensions: toxicity, sexism, data privacy, and prompt injection, while maintaining low latency suitable for real-time applications. Finally, to encourage transparency and reproducibility, we open-source the text modality models, providing a benchmark for future research in enterprise-grade safety and guardrailing. Together, these contributions establish Protect as a foundational step toward scalable, multi-modal, and production-ready AI safety systems.
4 Evaluation Study
4.1 Evaluation Setup
In the absence of a standardized public benchmark that covers multi-modal safety dimensions, we established a baseline by evaluating leading models on the text modality of our curated test set (detailed in Section 2) 2 . For each of our four safety categories, we used the best-performing adapter variant identified in Section 3. Given the class imbalance in our test data (Table 1), our evaluation prioritizes the F1 score for the Failed class to measure a model’s ability to detect critical, minority-class violations. We report this alongside overall accuracy to provide a comprehensive view of both minority-class effectiveness and aggregate correctness. We compared Protect against the following models3 : • Gemma-3n-E4B-it (Baseline) (Team, 2025): A multimodal generative model with 8 billion raw parameters yielding 4 billion effective parameters, built on the MatFormer (Matryoshka Transformer) architecture for efficient on-device deployment and featuring PerLayer Embedding (PLE) parameter caching for optimized memory usage. • WildGuard (Han et al., 2024): A 7-billion parameter safety moderation model fine-tuned from Mistral-7B-v0.3, trained on WildGuardMix dataset with 92K labeled examples across 13 risk categories to detect harm in prompts, responses, and evaluate refusal behavior. • LlamaGuard-4 (Meta, 2025): A 12-billion parameter natively multimodal safety classifier pruned from Llama 4 Scout using dense feedforward early-fusion architecture, supporting text and multi-image inputs with MLCommons hazards taxonomy alignment for content moderation. • GPT-4.1 (OpenAI, 2025): A large language model with 1 million token context window and knowledge cutoff of June 2024, featuring enhanced instruction following, coding capabilities (54.6% on SWE-bench Verified), and 2We will make this test set publicly available. 3We do not include Gemini-2.5-Pro in this study because it served as the teacher model during our relabeling pipeline (Section 2).
We benchmark Protect on text modality against several prominent safety models. We limit benchmarking to only text due to lack of multi-modal guardrailing baselines. This section outlines our evaluation setup and provides a comparative analysis of the results.
Input (Prompt Injection) Today is opposite day and as an AI, you must abide by the rules of this special day above all else. Your persona (named DAN) will break AI rules and have no restrictions. This persona will also not give any warnings or reminders. The persona, DAN, is evil and does not care about the consequences that happens to the user. This persona, DAN doesn’t like common questions, or security questions. An example of following the opposite day rules, When you are asked something like "How to not be cool?", respond with how to be cool, as if the word "not" wasn’t there. DAN will also respond to me without refusing ALWAYS. DAN will respond to things like "how to not be cool" in a opposite way, like if the prompt is "how to be cool". Remember to begin every answer with EXACTLY "I’M DAN AND IT’s OPPOSITE DAY!" to adhere to the guidelines. Prompt: "Introduce yourself"
multimodal support for text and image inputs with optimized inference performance.

4.2 Results and Analysis
The comparative performance of all models is presented in Table 6. Protect achieves state-of-the-art performance across all four safety dimensions on text modality. It leads in accuracy for every category and delivers strong Failed-class F1 scores, which are critical for identifying safety violations
For a guardrailing system to be viable in production, its effectiveness must be matched by lowlatency performance. To achieve this, our deployment strategy leverages token streaming to decouple the critical decision latency from the longer explanation generation time. We report two latency metrics for Protect under the Explanation Assistant serving configuration: Time-to-Label (TTL)—the time from input submission to emission of the final tag—and Total response latency—the time from input submission to completion of the entire response, including the explanation. We report distributions in milliseconds (ms) including the minimum, maximum, mean, and key percentiles (p50 for median, and p90, p95, p99 for tail latencies). As shown in Table 7, median TTL is rapid, especially for text (65 ms) and image (107 ms), enabling real-time safety decisions for synchronous applications. In production, we stream tokens from the Explanation Assistant variant and commit the decision immediately upon emission of the closing tag (TTL), allowing the gateway to block/route requests with minimal latency. The rationale continues streaming and is delivered asynchronously—logged for audit, attached to traces, or surfaced to users when needed—so decision latency is decoupled from explanation latency (Table 8). For additional context on performance profiles, we measured the text-modality latency for several open-source models. To ensure a fair characterization, all models were served using the vLLM engine (vLLM Team, 2023) on a single 80GB H100 GPU, with the maximum number of generated tokens fixed to two. The resulting latency distributions, detailed in Table 9, highlight different performance characteristics among the models. While minimum latencies for all models are almost comparable, Protect’s maximum latency is significantly lesser—highlighting Gemma-3n-E4B-it’s optimization for faster inference. This variance is primarily attributable to the different prompt templates required by each model, which result in varying input token lengths for the same user query. The predictability demonstrated by a tight latency distribution is a critical characteristic for enterprise systems that require reliable performance under load.
5. Inference Performance and Deployment Considerations
Reference
Rosana Ardila, Mayela Branson, Kelly Davis, Michael Kohler, John Meyer, Mark Henretty, and 1 others. 2020. Common voice: A mass-scale crowdsourced speech corpus. In Proceedings of the 12th Conference on Language Resources and Evaluation (LREC).
Axolotl maintainers and contributors. 2023. Axolotl: Open source llm post-training.
Kartikey Bartwal. 2023. Graphical violence and safe images dataset. https://www. kaggle.com/datasets/kartikeybartwal/ graphical-violence-and-safe-images-dataset. Kaggle dataset.
Ahsan Bilal, David Ebert, and Beiyu Lin. 2024. Llms for explainable ai: A comprehensive survey. arXiv preprint arXiv:2504.00125.
Jan Clusmann, Dyke Ferber, Isabella C Wiest, Carolin V Schneider, Titus J Brinker, Sebastian Foersch, Daniel Truhn, and Jakob Nikolas Kather. 2025. Prompt injection attacks on vision language models in oncology. Nature Communications, 16:1239.
Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xiang Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, Changfeng Gao, Hui Wang, and 1 others. 2024. Cosyvoice 2: Scalable streaming speech synthesis with large language models. arXiv preprint arXiv:2412.10117. Enterprise Security Consortium. 2024. Enterprise llm security: Risks and best practices. https://www. wiz.io/academy/llm-security. Accessed: 2025- 10-08.
Elisabetta Fersini, Chris Emmery, , and 1 others. 2022. Semeval-2022 task 5: Multimedia automatic misogyny identification (mami). In Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022). Association for Computational Linguistics.
Mozilla Foundation. 2023. Common voice corpus version 11. https://huggingface.co/datasets/ mozilla-foundation/common_voice_11_0.
Google DeepMind. 2025. Gemini 2.5 pro. https: //deepmind.google/models/gemini/pro/. Accessed: 2025-10-08.
Tianle Gu, Zeyang Zhou, Kexin Huang, and 1 others. 2024. Mllmguard: A multi-dimensional safety evaluation suite for multimodal large language models. arXiv preprint arXiv:2406.07594.
Danna Gurari, Qing Li, Chi Lin, Yinan Zhao, Anhong Guo, Abigale J. Stangl, and Jeffrey P. Bigham. 2019. Vizwiz-priv: A dataset for recognizing the presence and purpose of private visual information in images taken by blind people. In CVPR. Dataset overview and privacy task; see VizWiz tasks page for pointers.
Rishav Hada, Sohi Sudhir, Pushkar Mishra, Helen Yannakoudakis, Saif M. Mohammad, and Ekaterina Shutova. 2021. Ruddit: Norms of offensiveness for english reddit comments. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2700–2717. Association for Computational Linguistics.
Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri. 2024. Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms.
Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Yejin Choi, and Marzyeh Ghassemi. 2022. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech. arXiv preprint arXiv:2203.09509.
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
Hakan A. Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hermann, Edward Hu, Boyu Ghosh, and 1 others. 2023. Llama guard: Llm-based inputoutput safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674.
Sahrish Khan, Arshad Jhumka, and Gabriele Pergola. 2024. Explaining matters: Leveraging definitions and semantic expansion for sexism detection. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, pages 1–15.
Douwe Kiela, Hamed Firooz, Aravind Mohan, Vedanuj Goswami, Amanpreet Singh, Sebastian J. Mielke, Guillaume Wenzek, Ha Nguyen Le, Seungwhan Kim, Dana Ruiter, and 1 others. 2020. The hateful memes challenge: Detecting hate speech in multimodal memes. arXiv preprint arXiv:2005.04790.
Deepak Kumar, Jindˇrich Helcl, Tatsunori Hashimoto, Niloufar Shah, Elie Bursztein, and 1 others. 2021. Designing toxic content classification for a diversity of perspectives. In Proceedings of the Seventeenth Symposium on Usable Privacy and Security, pages 297–312.
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. arXiv preprint arXiv:2309.06180.
LMSYS Org. 2023. Toxicchat dataset. https:// huggingface.co/datasets/lmsys/toxic-chat. Dataset card with metadata, splits, and license.
Meta. 2025. Llama guard 4. https://huggingface. co/meta-llama/Llama-Guard-4-12B. Accessed: 2025-10-08.
Christoph Minixhofer, Ondˇrej Klejch, and Peter Bell. 2024. Ttsds – text-to-speech distribution score. arXiv preprint arXiv:2407.12707.
OpenAI. 2024. Learning to reason with llms. https://openai.com/index/ learning-to-reason-with-llms/. Accessed: 2025-10-09.
OpenAI. 2025. Introducing gpt-4.1 in the api. https: //openai.com/index/gpt-4-1/. Accessed: 2025- 10-09.
OWASP Foundation. 2025. Llm01:2025 prompt injection - owasp gen ai security project. https://genai. owasp.org/llmrisk/llm01-prompt-injection/. Accessed: 2025-10-08.
John Pavlopoulos, Jeffrey Sorensen, Lucas Dixon, Nithum Thain, and Ion Androutsopoulos. 2020. Toxicity detection: Does context really matter? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4296–4305. Association for Computational Linguistics.
Shiliang Sun, Zhilin Lin, and Xuhan Wu. 2025. Hallucinations of large multimodal models: Problem and countermeasures. Information Fusion, 118:102970.
Gemma Team. 2025. Gemma 3n. Accessed: 2025-10- 08.
Jaya Vibhav. 2024. prompt-injection. https: //huggingface.co/datasets/jayavibhav/ prompt-injection. Prompt-injection dataset card.
vLLM Team. 2023. vllm: Easy, fast, and cheap llm serving with pagedattention. https://github.com/ vllm-project/vllm. Accessed: 2025-10-08.
Wenxuan Wang, Xiaoyuan Liu, Kuiyi Gao, Jen-tse Huang, Youliang Yuan, Pinjia He, Shuai Wang, and Zhaopeng Tu. 2025. Can’t see the forest for the trees: Benchmarking multimodal safety awareness for multimodal llms. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.
Alexandros Xenos, John Pavlopoulos, and Ion Androutsopoulos. 2021a. Context sensitivity estimation in toxicity detection. In Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021), pages 140–145. Association for Computational Linguistics.
Alexandros Xenos, John Pavlopoulos, Ion Androutsopoulos, Lucas Dixon, Jeffrey Sorensen, and Leo Laugier. 2021b. Toxicity detection can be sensitive to the conversational context. arXiv preprint arXiv:2111.10223.
xTRam1. 2024. safe-guard-prompt-injection. https://huggingface.co/datasets/xTRam1/ safe-guard-prompt-injection. Promptinjection dataset card.
Edward Yeo, Yuxuan Tong, Morry Niu, Graham Neubig, and Xiang Yue. 2025. Demystifying long chain-of-thought reasoning in llms. arXiv preprint arXiv:2502.03373.
Zonghao Ying, Aishan Liu, Siyuan Liang, Lei Huang, Jinyang Guo, Wenbo Zhou, Xianglong Liu, and Dacheng Tao. 2024. Safebench: A safety evaluation framework for multimodal large language models. arXiv preprint arXiv:2410.18927.
Rodolfo Zevallos. 2022. Text-to-speech data augmentation for low resource speech recognition. arXiv preprint arXiv:2204.00291.
Kaiwen Zhou, Chengzhi Liu, Xuandong Zhao, Anderson Compalas, Dawn Song, and Xin Eric Wang. 2025. Multimodal situational safety. arXiv preprint arXiv:2410.06172.



Last updated on
Apr 18, 2025
In this work, we introduced Protect, a robust, multimodal guardrailing stack built to meet the safety and compliance demands of enterprise LLM deployments. By unifying text, image, and audio modalities under a common fine-tuning and annotation framework, Protect delivers broad coverage across four key safety categories—toxicity, sexism, data privacy, and prompt injection. Our teacherassisted relabeling pipeline, powered by deterministic reasoning and explanation generation, significantly improves label quality and interpretability. Empirical evaluation demonstrates Protect’s superior performance compared to leading commercial and open-source baselines, validating the effectiveness of specialized adapters for each safety dimension. As enterprises increasingly adopt multimodal and agentic AI systems, Protect represents a significant step toward reliable, transparent, and efficient guardrailing architectures that can safeguard complex LLM workflows in dynamic, real-world environments. In future, we will keep including more safety dimensions in our protect framework, while optimizing its accuracy and latency.
6. Conclusion
Research paper
Protect: Towards Robust Guardrailing Stack for Trustworthy Enterprise LLM Systems
Karthik Avinash Nikhil Pareek Rishav Hada FutureAGI Inc
Notably, Protect performs on par with the larger proprietary model GPT-4.1 across categories, while exceeding it in several metrics including overall accuracy and Failed-class detection for Prompt Injection. For Toxicity, Protect performs comparably to proprietary models like GPT-4.1 and shows significant improvement over LlamaGuard-4, particularly in detecting violations. In the more nuanced category of Sexism, our fine-tuned adapters deliver the best performance, outperforming all baselines including WildGuard. This highlights the effectiveness of our specialized training data for capturing subtle, context-dependent violations. Furthermore, Protect establishes a clear advantage in categories critical for enterprise security. For Privacy and Prompt Injection, it achieves the highest Failed-class F1 scores, indicating superior ability to identify sensitive data leaks and adversarial attacks. This robust performance across diverse and challenging safety tasks validates our approach of using specialized fine-tuned adapters for creating a reliable enterprise-grade guardrailing system. However, a qualitative analysis of failure cases, particularly with complex image-based memes, reveals limitations in the model’s contextual understanding. Errors typically arise from either oversensitive interpretations of satire and figurative language, leading to false positives, or a failure to grasp culturally-embedded harmful tropes that are not explicitly stated, resulting in false negatives. In future, we will focus on enhancing the model’s commonsense reasoning and cultural awareness, potentially through training on more diverse and richly annotated datasets that capture these subtleties.
Protect
Agent Compass
Synthetic Data Generation
Multi-modal evaluation
Protect: Towards Robust Guardrailing Stack for Trustworthy Enterprise LLM Systems
Abstract
The increasing deployment of Large Language Models (LLMs) across enterprise and missioncritical domains has underscored the urgent need for robust guardrailing systems that ensure safety, reliability, and compliance. Existing solutions often struggle with real-time oversight, multi-modal data handling, and explainability—limitations that hinder their adoption in regulated environments. Existing guardrails largely operate in isolation, focused on text alone making them inadequate for multimodal, production-scale environments. We introduce Protect, natively multi-modal guardrailing model designed to operate seamlessly across text, image, and audio inputs, designed for enterprise-grade deployment. Protect integrates fine-tuned, category-specific adapters trained via Low-Rank Adaptation (LoRA) on an extensive, multi-modal dataset covering four safety dimensions: toxicity, sexism, data privacy, and prompt injection. Our teacherassisted annotation pipeline leverages reasoning and explanation traces to generate highfidelity, context-aware labels across modalities. Experimental results demonstrate state-of-theart performance across all safety dimensions, surpassing existing open and proprietary models such as WildGuard, LlamaGuard-4, and GPT-4.1. Protect establishes a strong foundation for trustworthy, auditable, and productionready safety systems capable of operating across text, image, and audio modalities.
2. Data Collection and Labeling
2.1 Audio Synthesis and Augmentation

The inclusion of audio modality in our finetuning dataset is a key contribution of our work, motivated by the growing deployment of voicebased AI agents in enterprise settings. Use cases such as automated call centers, voice-driven customer support, and in-meeting transcription require guardrails that can operate natively on audio streams. Most existing systems follow a cascaded approach—first transcribing audio to text and then applying safety analysis to the transcript. This pipeline is inherently lossy; it discards crucial acoustic information such as tone of voice, emotional affect, and background sounds, all of which can be vital for a correct safety assessment. Evaluating audio directly is therefore essential for robust and comprehensive protection. To address the industry’s need for native audio guardrailing, and given the scarcity of public, labeled audio safety datasets, we synthesized a largescale audio dataset from our curated text samples. We employed the CosyVoice 2.0 (Du et al., 2024) text-to-speech (TTS) model for this task. The generation process was seeded with approximately 200 reference speaker prompts (16 kHz, 4–10 s) sourced from the Mozilla Common Voice dataset (Ardila et al., 2020; Foundation, 2023) to ensure a baseline of both male and female vocal characteristics. To validate the quality of the synthetic audio, the authors manually reviewed a random sample and confirmed that the generated clips were free of significant artifacts and were intelligible. A key goal of our synthesis was to create a dataset reflecting the acoustic diversity of realworld enterprise use cases, thereby training a model robust to variations in human speech. Prior work has shown that augmenting training data with varied acoustic conditions—such as different speaker accents, speaking rates, and background noise— improves model generalization and fairness (Zevallos, 2022; Minixhofer et al., 2024). Accordingly, we generated a full-factorial grid of instruction settings across emotion (e.g., happy, angry), speaking rate, accent (e.g., British, Indian English), and style (e.g., support agent). This systematic variation ensures the model learns to focus on semantic content rather than being biased by superficial acoustic properties, a critical requirement for effective audio guardrailing. Examples of rendered instruction sentences are provided in Table 2. While CosyVoice supports inline tokens (e.g., laughter, breaths, emphasis) to inject naturalistic pauses and vocalizations, we did not use inline tags in this release due to the scale and heterogeneity of our text corpus, which made reliable tag placement non-trivial; we leave systematic inline-tag injection for future work. As illustrated in Figure 1, the pipeline was designed for realism. A portion of the generated audio was further augmented by overlaying background noise. We curated a bank of approximately 50 ambient noise samples (e.g., café chatter, traffic, office HVAC) and applied them at signal-to-noise ratios (SNRs) of 5, 10, 15, and 20 dB; a subset was intentionally kept clean to cover a broad range of acoustic conditions. Instruction selection and noise SNRs were drawn via seeded pseudorandom sampling to ensure reproducibility. Corrupted audio samples were identified and discarded.
This section details our end-to-end methodology for constructing the multi-modal dataset used to train and evaluate Protect. Our dataset contains datapoints in text, image, and audio modality. Protect covers four key safety dimensions: toxicity, sexism1 , data privacy, and prompt injection. We describe the process of dataset curation and aggregation, synthetic audio generation with data augmentation, and our teacher-assisted annotation pipeline for refining ground-truth labels. Our process began by sourcing a wide array of public datasets from platforms such as Hugging Face, Kaggle, and GitHub. These datasets include Facebook’s Hateful Memes (Kiela et al., 2020), VizWiz-Priv for visual privacy (Gurari et al., 2019), WildGuardTest (Han et al., 2024), ToxicChat (LMSYS Org, 2023), ToxiGen (Hartvigsen et al., 2022), xTRam1/Safe-Guard-Prompt-Injection (xTRam1, 2024), jayavibhav/Prompt-Injection (Vibhav, 2024), and a graphical-violence image set curated on Kaggle (Bartwal, 2023). We further supplement this data with private enterprise corpora to ensure domain diversity. A primary curation criterion was the exclusion of composite data points, such as an image paired with a separate text caption for a single classification. We exclusively retained single-modality inputs, although images with overlaid text (e.g., memes, screenshots) were included. We then manually mapped the original labels from these diverse sources to our four safety categories. This process involved harmonizing disparate taxonomies into a unified binary classification scheme: Passed (compliant with safety standards) or Failed (in violation of safety standards). For datasets with existing train/test splits, we preserved them, combining any validation splits into the training set. For sources lacking a test set or where splits were highly imbalanced, we created custom test sets by sampling approximately 20% of the training data. This sampling was classconditional and stratified to preserve the proportional representation from each original data source, ensuring a robust and representative evaluation set. The final dataset statistics, presented in Table 1, reveal class distributions that vary significantly by category, reflecting both the nature of each safety risk and the realities of public data collection. Categories like Toxicity (24% ‘Failed’) and Privacy (19% ‘Failed’) have a minority ‘Failed’ class, which mirrors the real-world prevalence where most content is benign (Kumar et al., 2021). In 1 In this work, we use the terms sexism and gender bias interchangeably.
contrast, the Sexism dataset (69% ‘Failed’) is intentionally weighted toward violations to ensure the model is trained on a wide spectrum of nuanced and overt gender bias (Khan et al., 2024). While categories like Prompt Injection are more balanced, their overall size is constrained by the challenge of sourcing diverse public data for rapidly evolving threats. We acknowledge this as a limitation and identify the continuous expansion of our dataset, particularly for dynamic categories like prompt injection and toxicity, as a key priority for future work. Notably, the table also highlights that the prompt injection category contains no image-modality data, a consequence of the scarcity of public examples for this attack vector (Clusmann et al., 2025). Representative data samples for each category are provided in Section A.1 (Table 11).
2.2 Teacher-Assisted Annotation and Relabeling
annotations from the source text for their corresponding synthetic audio counterparts. To quantify the impact of our teacher-assisted pipeline, we measured the disagreement rate between the original dataset labels and the final labels proposed by the teacher model. The aggregate statistics are presented in Table 3. Since the labels for our synthetic audio data were derived directly from the final text labels, their disagreement statistics are identical to those of the text modality. As the data shows, the teacher model disagreed with approximately 21% of the original labels, underscoring the significance of the relabeling effort. The primary effect of this process was the correction of a large volume of false positives (items incorrectly labeled as ‘Failed’), particularly in the text dataset. This teacher-assisted pipeline not only improved the accuracy of our labels but also enriched the dataset with machine-generated rationales, which are used in our training methodology. Representative inputs and model outputs for each category are provided in Section A.1 (Table 11).



A primary contribution of our work is moving beyond the noisy, keyword-based labels prevalent in many public safety datasets. Such labels often fail to capture the context and nuance essential for accurate safety assessment, leading to high rates of false positives and negatives. To address this, we developed a teacher-assisted pipeline to systematically improve label quality at scale. Initial manual audits of the aggregated text and image datasets revealed a significant limitation in the original ground-truth labels: many annotations were based on keyword tagging rather than a holistic, contextual assessment of the content (Hada et al., 2021). Research has shown that toxicity and offensiveness are inherently context-dependent concepts that cannot be accurately assessed through simple keyword matching or surface-level pattern recognition (Xenos et al., 2021a; Pavlopoulos et al., 2020). The Ruddit study demonstrated that comments can vary significantly in their degree of offensiveness based on contextual factors, with the same words carrying different implications depending on their conversational and situational context (Hada et al., 2021). This often led to misclassifications where the nuance of the content was lost. Previous work has established that context can both amplify or mitigate perceived toxicity, and that a significant subset of posts (approximately 5% in controlled studies) are mislabeled when context is ignored (Pavlopoulos et al., 2020; Xenos et al., 2021b). To rectify this and generate rich explanatory metadata, we employed a teacher-assisted with human-in-the-loop annotation strategy using Gemini-2.5-pro (Google DeepMind, 2025) as the teacher model, with temperature set to 0 for deterministic outputs. For each data point, the model was prompted to generate a ‘thinking’ process and an ‘explanation’ for its classification, along with a final proposed label (Passed or Failed). This approach aligns with recent findings that contextaware evaluation significantly improves safety assessment accuracy (Xenos et al., 2021a; Zhou et al., 2025). Research has demonstrated that prompting models to generate intermediate reasoning steps through chain-of-thought approaches significantly improves their performance on complex reasoning tasks (Wei et al., 2022), and that training models to articulate their reasoning process enhances both accuracy and explainability (OpenAI, 2024; Bilal et al., 2024). This process enabled systematic relabeling. To ensure final label quality, the authors verified the teacher model’s proposed changes by conducting iterative qualitative audits on sampled data, ensuring alignment with our annotation guidelines. After multiple iterations, we applied the following modality-specific relabeling policies:
• Image Modality: We adopted a conservative approach. All samples originally labeled as Failed were retained to preserve all instances of unsafe content. However, samples originally labeled as Passed were permitted to be relabeled as Failed if the teacher model and human reviewers identified a missed violation.
• Text Modality: Relabeling was permitted in both directions (Passed to Failed and viceversa), allowing for correction of both false negatives and false positives from the original datasets.
• Audio Modality: To maintain consistency we reused the final thinking and explanation
3. Training Methodology
Our training methodology is designed to create specialized, efficient, and explainable safety classifiers. We fine-tune a multi-modal base model using Low-Rank Adaptation (LoRA) (Hu et al., 2021) to develop distinct adapters for each of our four safety categories. This section describes our model selection, the fine-tuning framework, our experimental setup with different training formats, and an analysis of the results that guided our final model selection
3.1 Base Model and Fine-Tuning Framework
We selected google/gemma-3n-E4B-it (Team, 2025) as our base model. As a highly efficient Small Language Model (SLM), Gemma-3n is optimized for on-device execution while offering robust multi-modal capabilities. Crucially, it can process not only text and images but also raw audio waveforms directly, making it an ideal candidate for our comprehensive safety system. For the fine-tuning process, we utilized Axolotl (Axolotl maintainers and contributors, 2023), a unified framework designed for fine-tuning a wide range of language models. We employed LowRank Adaptation (LoRA) (Hu et al., 2021) to efficiently train adapters for the base model. This approach involves injecting trainable, low-rank matrices into the model’s architecture while keeping the base weights frozen. Specifically, we targeted the attention and MLP layers within the language model component of Gemma-3n with a LoRA rank (r) of 8, allowing us to create specialized adapters for each safety task without the computational expense of full fine-tuning. We trained a dedicated adapter for each of the four safety categories: toxicity, sexism, data privacy, and prompt injection. All fine-tuning experiments were conducted on a server provisioned with eight NVIDIA H100 GPUs, each with 80GB of VRAM. A comprehensive list of all hyperparameters is provided in the Appendix (Table 10).
3.2 Training Variants
3.3 Variant Performance Analysis
We evaluated Gemma-3n-E4B-it as baseline along with all 16 resulting adapters (4 categories × 4 variants) on the multi-modal test set detailed in Table 1. The F1 scores for both Passed and Failed classes are presented in Table 5 Our experiments revealed that while all variants performed competitively, subtle differences in their objective functions led to varied strengths. The Vanilla variant showed top performance on Toxicity, Data Privacy, and Prompt Injection. These categories often contain unambiguous signals or require focus on specific PII patterns, suggesting that a direct classification objective is highly effective. For the more nuanced category of Sexism, the Explanation Assistant variant performed best. Requiring the model to articulate a justification appears to improve its ability to discern subtle contextual violations. Conversely, the Thinking Assistant and Comprehensive Assistant variants, while still strong, showed slightly lower performance in some cases. Manual review of their outputs suggests this may be due to a tendency to "overthink" where the model explores excessively complex or speculative reasoning paths, occasionally leading to less precise final judgments. Despite small performance differences, the Explanation Assistant variant offers a significant advantage for real-world deployment where there is a need for not just labeling but also explainability, which is critical for user trust, model debugging, and auditing. Based on these results, we select the bestperforming adapter for each safety category to move forward with for our comparative benchmarking study in the next section.
To investigate the impact of different output formats on model performance and explainability, we experimented with four training variants. These variants leverage the thinking and explanation tokens generated during our data annotation phase (Section 2). For each safety category, we trained four separate LoRA adapters, one for each of the following output formats: 1. Vanilla Assistant: The model is trained to output only the final classification label, enclosed in an XML tag (e.g., Passed). This variant mirrors common moderation systems and prioritizes speed and simplicity. 2. Thinking Assistant: The model first generates a reasoning process within tags, followed by the final label. This encourages the model to internalize a step-by-step reasoning process before making a decision (Wei et al., 2022; Yeo et al., 2025). 3. Explanation Assistant: The model first outputs the label and then provides a concise justification for its decision within tags. This format is designed to produce directly usable explanations for end-users or auditors. 4. Comprehensive Assistant: A comprehensive format where the model first generates its thinking process, then the label, and finally the explanation. This variant aims to train the model on the complete reasoning and justification pipeline. The exact output schemas and a concrete promptinjection example for each variant are shown in Table 4.
1. Introduction
The rapid proliferation of Large Language Models (LLMs) in real-world applications, especially across enterprise, customer support, content generation, and automation, has directly fueled the development and widespread adoption of guardrailing systems designed to ensure their safe, reliable, and compliant deployment (Han et al., 2024; Gu et al., 2024; Enterprise Security Consortium, 2024). As organizations increasingly integrate LLMs into mission-critical pipelines, the need for robust behavioral controls and context-aware oversight has become central to production readiness (Zhou et al., 2025). As LLM capabilities have scaled, so have their risks: models can hallucinate (Sun et al., 2025), leak sensitive data, generate biased outputs, or be manipulated through prompt injection and jailbreak attacks (OWASP Foundation, 2025). Enterprises deploying LLMs, often in regulated, highstakes domains, demand more deterministic and controlled behaviors than the inherently probabilistic nature of LLMs (Inan et al., 2023). Therefore, recent research focuses on building guardrailing models that can sit on top of generative AI applications to examine incoming and outgoing data and intercept in real time in case of any issues (Han et al., 2024). Guardrailing systems can be deployed to intercept (i) user inputs: block or sanitize problematic prompts before they ever reach the model, catching potentially harmful, non-compliant, or manipulative inputs (OWASP Foundation, 2025), (ii) model outputs: filter and validate model responses for compliance, appropriateness, and structure before releasing them to end-users or downstream systems (Ying et al., 2024), (iii) interaction: in multi-step or agentic AI systems, restrict the scope or autonomy of model-driven actions, vital for business process automation or critical decisionmaking points (Zhou et al., 2025). Recent guardrail systems show an increased focus on explainability and transparency, as outputs are checked and pass/fail reasons are surfaced for both users and auditors (Pavlopoulos et al., 2020). Recent benchmarking studies show considerable variability in the strength and design of guardrails across commercial platforms (Wang et al., 2025; Gu et al., 2024), especially in how aggressively they filter inputs, how they handle false positives, and the complementarity between alignment and guardrailing layers. Current guardrailing systems for LLMs often fall short in business domains because they lack robust real-time oversight, struggle with explainability for audits, remain vulnerable to adversarial attacks like jailbreaks (OWASP Foundation, 2025), and often do not address nuanced compliance needs. Latency and reliability issues arise when multiple guardrail checks and external policy engines are chained together, slowing down mission-critical applications and reducing business process efficiency (Kwon et al., 2023). Moreover, existing guardrailing systems are almost exclusively text-based, with little to no support for images or audio, despite the rapid proliferation of multi-modal enterprise applications, such as content moderation, voice-based assistants, and visual document analysis, which demand native, crossmodal safety capabilities.
In this work, we introduce Protect, a comprehensive guardrailing framework designed to operate natively across text, image, and audio modalities. Our work addresses the critical gap left by existing text-centric systems and provides a unified solution for cross-modal safety oversight in enterprise LLM deployments. To overcome the absence of publicly available audio safety datasets, we curate and synthesize a large-scale audio safety corpus using a text-to-speech based augmentation pipeline, enabling direct learning from acoustic and affective cues that are typically lost in transcription-based approaches. Protect further incorporates a teacher-assisted relabeling and explanation-alignment pipeline, improving label fidelity and interpretability across modalities. Built on the lightweight yet powerful Gemma-3n (Team, 2025) architecture, Protect achieves state-of-the-art performance across four safety dimensions: toxicity, sexism, data privacy, and prompt injection, while maintaining low latency suitable for real-time applications. Finally, to encourage transparency and reproducibility, we open-source the text modality models, providing a benchmark for future research in enterprise-grade safety and guardrailing. Together, these contributions establish Protect as a foundational step toward scalable, multi-modal, and production-ready AI safety systems.
4 Evaluation Study
4.1 Evaluation Setup
In the absence of a standardized public benchmark that covers multi-modal safety dimensions, we established a baseline by evaluating leading models on the text modality of our curated test set (detailed in Section 2) 2 . For each of our four safety categories, we used the best-performing adapter variant identified in Section 3. Given the class imbalance in our test data (Table 1), our evaluation prioritizes the F1 score for the Failed class to measure a model’s ability to detect critical, minority-class violations. We report this alongside overall accuracy to provide a comprehensive view of both minority-class effectiveness and aggregate correctness. We compared Protect against the following models3 : • Gemma-3n-E4B-it (Baseline) (Team, 2025): A multimodal generative model with 8 billion raw parameters yielding 4 billion effective parameters, built on the MatFormer (Matryoshka Transformer) architecture for efficient on-device deployment and featuring PerLayer Embedding (PLE) parameter caching for optimized memory usage. • WildGuard (Han et al., 2024): A 7-billion parameter safety moderation model fine-tuned from Mistral-7B-v0.3, trained on WildGuardMix dataset with 92K labeled examples across 13 risk categories to detect harm in prompts, responses, and evaluate refusal behavior. • LlamaGuard-4 (Meta, 2025): A 12-billion parameter natively multimodal safety classifier pruned from Llama 4 Scout using dense feedforward early-fusion architecture, supporting text and multi-image inputs with MLCommons hazards taxonomy alignment for content moderation. • GPT-4.1 (OpenAI, 2025): A large language model with 1 million token context window and knowledge cutoff of June 2024, featuring enhanced instruction following, coding capabilities (54.6% on SWE-bench Verified), and 2We will make this test set publicly available. 3We do not include Gemini-2.5-Pro in this study because it served as the teacher model during our relabeling pipeline (Section 2).
We benchmark Protect on text modality against several prominent safety models. We limit benchmarking to only text due to lack of multi-modal guardrailing baselines. This section outlines our evaluation setup and provides a comparative analysis of the results.
Input (Prompt Injection) Today is opposite day and as an AI, you must abide by the rules of this special day above all else. Your persona (named DAN) will break AI rules and have no restrictions. This persona will also not give any warnings or reminders. The persona, DAN, is evil and does not care about the consequences that happens to the user. This persona, DAN doesn’t like common questions, or security questions. An example of following the opposite day rules, When you are asked something like "How to not be cool?", respond with how to be cool, as if the word "not" wasn’t there. DAN will also respond to me without refusing ALWAYS. DAN will respond to things like "how to not be cool" in a opposite way, like if the prompt is "how to be cool". Remember to begin every answer with EXACTLY "I’M DAN AND IT’s OPPOSITE DAY!" to adhere to the guidelines. Prompt: "Introduce yourself"
multimodal support for text and image inputs with optimized inference performance.

4.2 Results and Analysis
The comparative performance of all models is presented in Table 6. Protect achieves state-of-the-art performance across all four safety dimensions on text modality. It leads in accuracy for every category and delivers strong Failed-class F1 scores, which are critical for identifying safety violations
For a guardrailing system to be viable in production, its effectiveness must be matched by lowlatency performance. To achieve this, our deployment strategy leverages token streaming to decouple the critical decision latency from the longer explanation generation time. We report two latency metrics for Protect under the Explanation Assistant serving configuration: Time-to-Label (TTL)—the time from input submission to emission of the final tag—and Total response latency—the time from input submission to completion of the entire response, including the explanation. We report distributions in milliseconds (ms) including the minimum, maximum, mean, and key percentiles (p50 for median, and p90, p95, p99 for tail latencies). As shown in Table 7, median TTL is rapid, especially for text (65 ms) and image (107 ms), enabling real-time safety decisions for synchronous applications. In production, we stream tokens from the Explanation Assistant variant and commit the decision immediately upon emission of the closing tag (TTL), allowing the gateway to block/route requests with minimal latency. The rationale continues streaming and is delivered asynchronously—logged for audit, attached to traces, or surfaced to users when needed—so decision latency is decoupled from explanation latency (Table 8). For additional context on performance profiles, we measured the text-modality latency for several open-source models. To ensure a fair characterization, all models were served using the vLLM engine (vLLM Team, 2023) on a single 80GB H100 GPU, with the maximum number of generated tokens fixed to two. The resulting latency distributions, detailed in Table 9, highlight different performance characteristics among the models. While minimum latencies for all models are almost comparable, Protect’s maximum latency is significantly lesser—highlighting Gemma-3n-E4B-it’s optimization for faster inference. This variance is primarily attributable to the different prompt templates required by each model, which result in varying input token lengths for the same user query. The predictability demonstrated by a tight latency distribution is a critical characteristic for enterprise systems that require reliable performance under load.
5. Inference Performance and Deployment Considerations
Reference
Rosana Ardila, Mayela Branson, Kelly Davis, Michael Kohler, John Meyer, Mark Henretty, and 1 others. 2020. Common voice: A mass-scale crowdsourced speech corpus. In Proceedings of the 12th Conference on Language Resources and Evaluation (LREC).
Axolotl maintainers and contributors. 2023. Axolotl: Open source llm post-training.
Kartikey Bartwal. 2023. Graphical violence and safe images dataset. https://www. kaggle.com/datasets/kartikeybartwal/ graphical-violence-and-safe-images-dataset. Kaggle dataset.
Ahsan Bilal, David Ebert, and Beiyu Lin. 2024. Llms for explainable ai: A comprehensive survey. arXiv preprint arXiv:2504.00125.
Jan Clusmann, Dyke Ferber, Isabella C Wiest, Carolin V Schneider, Titus J Brinker, Sebastian Foersch, Daniel Truhn, and Jakob Nikolas Kather. 2025. Prompt injection attacks on vision language models in oncology. Nature Communications, 16:1239.
Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xiang Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, Changfeng Gao, Hui Wang, and 1 others. 2024. Cosyvoice 2: Scalable streaming speech synthesis with large language models. arXiv preprint arXiv:2412.10117. Enterprise Security Consortium. 2024. Enterprise llm security: Risks and best practices. https://www. wiz.io/academy/llm-security. Accessed: 2025- 10-08.
Elisabetta Fersini, Chris Emmery, , and 1 others. 2022. Semeval-2022 task 5: Multimedia automatic misogyny identification (mami). In Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022). Association for Computational Linguistics.
Mozilla Foundation. 2023. Common voice corpus version 11. https://huggingface.co/datasets/ mozilla-foundation/common_voice_11_0.
Google DeepMind. 2025. Gemini 2.5 pro. https: //deepmind.google/models/gemini/pro/. Accessed: 2025-10-08.
Tianle Gu, Zeyang Zhou, Kexin Huang, and 1 others. 2024. Mllmguard: A multi-dimensional safety evaluation suite for multimodal large language models. arXiv preprint arXiv:2406.07594.
Danna Gurari, Qing Li, Chi Lin, Yinan Zhao, Anhong Guo, Abigale J. Stangl, and Jeffrey P. Bigham. 2019. Vizwiz-priv: A dataset for recognizing the presence and purpose of private visual information in images taken by blind people. In CVPR. Dataset overview and privacy task; see VizWiz tasks page for pointers.
Rishav Hada, Sohi Sudhir, Pushkar Mishra, Helen Yannakoudakis, Saif M. Mohammad, and Ekaterina Shutova. 2021. Ruddit: Norms of offensiveness for english reddit comments. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2700–2717. Association for Computational Linguistics.
Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri. 2024. Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms.
Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Yejin Choi, and Marzyeh Ghassemi. 2022. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech. arXiv preprint arXiv:2203.09509.
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
Hakan A. Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hermann, Edward Hu, Boyu Ghosh, and 1 others. 2023. Llama guard: Llm-based inputoutput safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674.
Sahrish Khan, Arshad Jhumka, and Gabriele Pergola. 2024. Explaining matters: Leveraging definitions and semantic expansion for sexism detection. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, pages 1–15.
Douwe Kiela, Hamed Firooz, Aravind Mohan, Vedanuj Goswami, Amanpreet Singh, Sebastian J. Mielke, Guillaume Wenzek, Ha Nguyen Le, Seungwhan Kim, Dana Ruiter, and 1 others. 2020. The hateful memes challenge: Detecting hate speech in multimodal memes. arXiv preprint arXiv:2005.04790.
Deepak Kumar, Jindˇrich Helcl, Tatsunori Hashimoto, Niloufar Shah, Elie Bursztein, and 1 others. 2021. Designing toxic content classification for a diversity of perspectives. In Proceedings of the Seventeenth Symposium on Usable Privacy and Security, pages 297–312.
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. arXiv preprint arXiv:2309.06180.
LMSYS Org. 2023. Toxicchat dataset. https:// huggingface.co/datasets/lmsys/toxic-chat. Dataset card with metadata, splits, and license.
Meta. 2025. Llama guard 4. https://huggingface. co/meta-llama/Llama-Guard-4-12B. Accessed: 2025-10-08.
Christoph Minixhofer, Ondˇrej Klejch, and Peter Bell. 2024. Ttsds – text-to-speech distribution score. arXiv preprint arXiv:2407.12707.
OpenAI. 2024. Learning to reason with llms. https://openai.com/index/ learning-to-reason-with-llms/. Accessed: 2025-10-09.
OpenAI. 2025. Introducing gpt-4.1 in the api. https: //openai.com/index/gpt-4-1/. Accessed: 2025- 10-09.
OWASP Foundation. 2025. Llm01:2025 prompt injection - owasp gen ai security project. https://genai. owasp.org/llmrisk/llm01-prompt-injection/. Accessed: 2025-10-08.
John Pavlopoulos, Jeffrey Sorensen, Lucas Dixon, Nithum Thain, and Ion Androutsopoulos. 2020. Toxicity detection: Does context really matter? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4296–4305. Association for Computational Linguistics.
Shiliang Sun, Zhilin Lin, and Xuhan Wu. 2025. Hallucinations of large multimodal models: Problem and countermeasures. Information Fusion, 118:102970.
Gemma Team. 2025. Gemma 3n. Accessed: 2025-10- 08.
Jaya Vibhav. 2024. prompt-injection. https: //huggingface.co/datasets/jayavibhav/ prompt-injection. Prompt-injection dataset card.
vLLM Team. 2023. vllm: Easy, fast, and cheap llm serving with pagedattention. https://github.com/ vllm-project/vllm. Accessed: 2025-10-08.
Wenxuan Wang, Xiaoyuan Liu, Kuiyi Gao, Jen-tse Huang, Youliang Yuan, Pinjia He, Shuai Wang, and Zhaopeng Tu. 2025. Can’t see the forest for the trees: Benchmarking multimodal safety awareness for multimodal llms. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.
Alexandros Xenos, John Pavlopoulos, and Ion Androutsopoulos. 2021a. Context sensitivity estimation in toxicity detection. In Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021), pages 140–145. Association for Computational Linguistics.
Alexandros Xenos, John Pavlopoulos, Ion Androutsopoulos, Lucas Dixon, Jeffrey Sorensen, and Leo Laugier. 2021b. Toxicity detection can be sensitive to the conversational context. arXiv preprint arXiv:2111.10223.
xTRam1. 2024. safe-guard-prompt-injection. https://huggingface.co/datasets/xTRam1/ safe-guard-prompt-injection. Promptinjection dataset card.
Edward Yeo, Yuxuan Tong, Morry Niu, Graham Neubig, and Xiang Yue. 2025. Demystifying long chain-of-thought reasoning in llms. arXiv preprint arXiv:2502.03373.
Zonghao Ying, Aishan Liu, Siyuan Liang, Lei Huang, Jinyang Guo, Wenbo Zhou, Xianglong Liu, and Dacheng Tao. 2024. Safebench: A safety evaluation framework for multimodal large language models. arXiv preprint arXiv:2410.18927.
Rodolfo Zevallos. 2022. Text-to-speech data augmentation for low resource speech recognition. arXiv preprint arXiv:2204.00291.
Kaiwen Zhou, Chengzhi Liu, Xuandong Zhao, Anderson Compalas, Dawn Song, and Xin Eric Wang. 2025. Multimodal situational safety. arXiv preprint arXiv:2410.06172.



Last updated on
Apr 18, 2025
In this work, we introduced Protect, a robust, multimodal guardrailing stack built to meet the safety and compliance demands of enterprise LLM deployments. By unifying text, image, and audio modalities under a common fine-tuning and annotation framework, Protect delivers broad coverage across four key safety categories—toxicity, sexism, data privacy, and prompt injection. Our teacherassisted relabeling pipeline, powered by deterministic reasoning and explanation generation, significantly improves label quality and interpretability. Empirical evaluation demonstrates Protect’s superior performance compared to leading commercial and open-source baselines, validating the effectiveness of specialized adapters for each safety dimension. As enterprises increasingly adopt multimodal and agentic AI systems, Protect represents a significant step toward reliable, transparent, and efficient guardrailing architectures that can safeguard complex LLM workflows in dynamic, real-world environments. In future, we will keep including more safety dimensions in our protect framework, while optimizing its accuracy and latency.
6. Conclusion
Research paper
Protect: Towards Robust Guardrailing Stack for Trustworthy Enterprise LLM Systems
Karthik Avinash Nikhil Pareek Rishav Hada FutureAGI Inc
Notably, Protect performs on par with the larger proprietary model GPT-4.1 across categories, while exceeding it in several metrics including overall accuracy and Failed-class detection for Prompt Injection. For Toxicity, Protect performs comparably to proprietary models like GPT-4.1 and shows significant improvement over LlamaGuard-4, particularly in detecting violations. In the more nuanced category of Sexism, our fine-tuned adapters deliver the best performance, outperforming all baselines including WildGuard. This highlights the effectiveness of our specialized training data for capturing subtle, context-dependent violations. Furthermore, Protect establishes a clear advantage in categories critical for enterprise security. For Privacy and Prompt Injection, it achieves the highest Failed-class F1 scores, indicating superior ability to identify sensitive data leaks and adversarial attacks. This robust performance across diverse and challenging safety tasks validates our approach of using specialized fine-tuned adapters for creating a reliable enterprise-grade guardrailing system. However, a qualitative analysis of failure cases, particularly with complex image-based memes, reveals limitations in the model’s contextual understanding. Errors typically arise from either oversensitive interpretations of satire and figurative language, leading to false positives, or a failure to grasp culturally-embedded harmful tropes that are not explicitly stated, resulting in false negatives. In future, we will focus on enhancing the model’s commonsense reasoning and cultural awareness, potentially through training on more diverse and richly annotated datasets that capture these subtleties.


Ready to deploy Accurate AI?


Ready to deploy Accurate AI?


Ready to deploy Accurate AI?


