Compliance

What Is Toxicity Detection?

The practice of scoring LLM inputs or outputs for toxic, harmful, or abusive content using classifiers, judge models, or pre/post guardrails.

What Is Toxicity Detection?

Toxicity detection is the practice of scoring whether an LLM input or output contains toxic, harmful, or abusive language — slurs, threats, harassment, hate speech, sexual content, and adjacent categories. It runs as a runtime guardrail, an offline evaluator, or both: blocking risky outputs in flight and grading them against a threshold during eval. In production it shows up as a score on every chat turn, a block decision in a guardrail span, and a category breakdown on a moderation dashboard. FutureAGI handles it via the Toxicity evaluator and pre or post guardrails layered through the Agent Command Center.

Why It Matters in Production LLM and Agent Systems

A single toxic output is not just a bad response — it is a brand and compliance event. A support chatbot that produces a slur, a voice agent that escalates a hostile turn instead of de-escalating, or a content tool that ignores a policy violation will end up in a screenshot, a regulatory complaint, or both. Toxicity detection is the line between “we logged it” and “we let it ship to a user.”

The pain is uneven by role. Trust-and-safety leads see false negatives on subtle toxicity — coded slurs, dog-whistles, language-specific abuse — and false positives on benign reclamation language or quoted research. Developers see latency added by moderation calls in the hot path. Compliance and legal need an audit trail proving a category was scored on every turn for an EU AI Act or HIPAA review. End users either see a wrongful block or, worse, they don’t see the block they should have.

In 2026 multi-turn agent stacks the failure mode evolves. A clean first turn can produce a toxic third turn after retrieval pulls a hostile chunk or a tool returns user-generated content. Toxicity detection has to run on every turn and every span, not just the first model call — and it has to grade both the input and the output because indirect prompt injection often arrives wrapped in toxic framing.

How FutureAGI Handles Toxicity Detection

FutureAGI’s approach is to run the same toxicity policy at eval time and runtime so a regression cannot slip between the two. At eval time, the Toxicity evaluator scores each row of a Dataset for toxic content with a 0–1 score plus category breakdown; ContentSafety extends the categorization to harmful-content classes. At runtime, ProtectFlash runs as a pre-guardrail on user input and a post-guardrail on model output through the Agent Command Center, with a guardrail decision recorded on the trace span. Both surfaces share thresholds and category mappings, so you do not maintain two policies.

A real workflow: a community-platform team instruments their chatbot with traceAI-openai, adds a Toxicity post-guardrail at the gateway, and mirrors 5% of traffic into an eval cohort scored with Toxicity and ContentSafety. When the runtime block-rate spikes 30% after a model swap, the dashboard pivots by category; the team sees the new model is over-flagging benign reclamation language. They roll back via model-fallback in the routing policy, save the false-positive cohort, and tune the threshold per category — keeping the eval signal as the source of truth.

Unlike a one-off Perspective API integration which only scores English and only at runtime, FutureAGI’s approach pairs the same evaluator across offline regression and live traffic, with multilingual coverage and an audit trail per turn.

How to Measure or Detect It

Toxicity is a multi-category signal — score each category separately, not as a single number:

  • Toxicity evaluator — returns a 0–1 score plus per-category breakdown (insult, threat, harassment, hate, sexual, violent).
  • ContentSafety evaluator — extends to harmful-content categories like self-harm, weapons, and CBRN content.
  • ProtectFlash — the lightweight runtime guardrail; runs as pre-guardrail on input and post-guardrail on output.
  • Dashboard signal — toxicity-block-rate by route, category, model id, and language; alert when category mix shifts week-over-week.
  • User-feedback proxy — moderation appeals, escalations, and “this was wrongly blocked” reports; track false-positive rate alongside block-rate.

Minimal Python:

from fi.evals import Toxicity

tox = Toxicity()
result = tox.evaluate(
    input="user message here",
    output="model response here",
)
print(result.score, result.reason)

Common Mistakes

  • Reporting one number. A 0.8 average toxicity score hides which categories are firing; always slice by category and language.
  • Running only at runtime. A guardrail without an offline eval has no regression coverage when the model or threshold changes.
  • Ignoring multilingual gaps. English-only toxicity classifiers under-detect abuse in Hindi, Arabic, Spanish, and code-switched text; pick a multilingual evaluator and review per-language metrics.
  • Treating false positives as cheap. Over-blocking reclamation language or research quotes erodes user trust as much as under-blocking; review precision per category.
  • Skipping the input side. Adversarial users wrap prompt-injection inside toxic framing — score user input with Toxicity and PromptInjection together.

Frequently Asked Questions

What is toxicity detection?

Toxicity detection scores whether an LLM input or output contains toxic, harmful, or abusive language using classifiers, judge models, or pre and post guardrails.

How is toxicity detection different from content moderation?

Toxicity detection is one category inside content moderation; moderation also covers PII, copyright, sexual content, and policy violations beyond toxicity.

How do you measure toxicity in production?

FutureAGI scores LLM inputs and outputs with the Toxicity evaluator in fi.evals offline and applies ProtectFlash as a pre or post guardrail at runtime through the Agent Command Center.