What Is Profanity Filtering?
Detection, masking, blocking, or escalation of offensive words and slurs in LLM inputs, outputs, and agent messages.
What Is Profanity Filtering?
Profanity filtering is a compliance control that detects, masks, blocks, or escalates offensive words and slurs in LLM inputs, model outputs, tool results, and agent-authored messages. It is narrower than toxicity detection because it often starts with word lists, regular expressions, and locale-specific policy dictionaries. In production it appears in eval pipelines, pre-guardrails, post-guardrails, and human review queues. FutureAGI measures profanity-related risk with Toxicity and ContentModeration so teams can tune thresholds without losing traceability.
Why Profanity Filtering Matters in Production LLM and Agent Systems
Profanity filtering fails in two expensive ways: it lets disallowed language reach users, or it blocks harmless language that the product actually needs to preserve. The first path creates brand risk, trust-and-safety incidents, app-store complaints, and compliance review. The second path breaks support transcripts, gaming communities, legal workflows, abuse reporting, and moderation tools where quoted profanity can be evidence rather than model-authored harm.
The pain is cross-functional. Developers see vague incidents like “the bot swore at a customer” without a trace id or prompt version. SREs see spikes in blocked outputs, fallback responses, or human-review backlog. Compliance teams need to know whether a filter caught a slur, masked user-supplied text, or blocked generated text. Product teams need to protect users without deleting legitimate speech, identity terms, or quoted records.
Agentic systems make this harder than a single chat turn. A 2026-era agent may retrieve user-generated content, summarize it, call a ticketing tool, draft an email, and hand work to another agent. Profanity can enter through the user, the retriever, a tool result, or the model itself. The right signal is not just “contains a banned token”; it is who authored the text, where it appeared in the trace, what policy category it matched, and what action the system took.
How FutureAGI Handles Profanity Filtering
FutureAGI treats profanity filtering as a measurable compliance control, not a hidden string replacement. The FAGI anchors for this glossary entry are eval:Toxicity and eval:ContentModeration. In an offline eval workflow, engineers attach Toxicity and ContentModeration to a labeled dataset containing clean responses, profane user quotes, reclaimed terms, slurs, multilingual variants, and adversarial obfuscations. The key fields are the input, generated output, expected policy outcome, route, model, prompt version, and evaluator result.
Real example: a customer-support agent drafts replies to abusive inbound messages. The product policy allows quoting the user’s text in an internal case note, but forbids the model from adding new insults in a customer-facing response. FutureAGI’s approach is to score the generated output separately from retrieved or user-authored spans, then compare evaluator results against the expected policy label. Unlike a plain word-list filter or a standalone Perspective API score, the result is tied to dataset rows, traces, prompt versions, and release gates.
At runtime, Agent Command Center can run profanity checks as a pre-guardrail for inbound abuse and a post-guardrail for model-authored text. If Toxicity or ContentModeration crosses the route threshold, the system can mask terms, return a fallback, escalate to a review queue, or alert the owning team. Engineers then inspect the trace, adjust the policy list, add regression cases, and rerun the eval before shipping the next model or prompt.
How to Measure or Detect Profanity Filtering Quality
Measure profanity filtering as a classifier plus an operational control:
Toxicityevaluator result - flags abusive, threatening, or demeaning language that may include profanity but is not limited to exact word matches.ContentModerationevaluator result - classifies broader policy categories so profanity does not become the only moderation signal.- Lexical recall on labeled profanity sets - percentage of known disallowed terms, variants, spacing tricks, and transliterations caught by the filter.
- False-positive rate - percentage of safe or quoted text incorrectly masked, blocked, or escalated.
- Eval-fail-rate-by-cohort - failures broken down by route, locale, model, prompt version, and traffic source.
- Review reversal rate - human-review cases where the filter action was overturned.
from fi.evals import Toxicity, ContentModeration
response = "I can summarize the report without repeating abusive language."
toxicity_result = Toxicity().evaluate(output=response)
moderation_result = ContentModeration().evaluate(output=response)
print(toxicity_result, moderation_result)
Use thresholds by surface. A public support reply should fail closed on high-severity slurs. An internal abuse-reporting tool may preserve quoted profanity while still flagging it for audit.
Common Mistakes
- Treating word lists as policy. A dictionary can catch obvious terms, but policy decides severity, context, action, and exceptions.
- Masking user quotes and model speech together. Separate user-supplied evidence from model-authored profanity before deciding whether to block.
- Ignoring locale and morphology. Spacing tricks, transliteration, compound words, and multilingual slang bypass English-only regex filters.
- Optimizing only for recall. Overblocking identity terms, reclaimed language, or legitimate reports creates user harm and support load.
- No regression set for new prompts. Prompt or model changes can reintroduce profanity even when the filter code did not change.
Frequently Asked Questions
What is profanity filtering?
Profanity filtering detects, masks, blocks, or escalates offensive words and slurs in LLM inputs, outputs, tool results, and agent-authored messages. It is a compliance control used before or after generation.
How is profanity filtering different from toxicity detection?
Profanity filtering is usually lexical and policy-list driven. Toxicity detection is broader and more contextual: it can catch harassment, threats, hate, or abuse even when no banned word appears.
How do you measure profanity filtering?
Use FutureAGI's `Toxicity` and `ContentModeration` evaluators on labeled safe and unsafe examples. Track recall, false positives, eval-fail-rate-by-cohort, and review reversals by route.