Evaluating LLM Content Moderation Systems: The 2026 Methodology
How to evaluate LLM content moderation in 2026: the 2x2 matrix per category, adversarial and benign test sets, threshold tuning, and the monthly drift loop.
Table of Contents
A content moderation system fails two ways, and a single F1 hides which one is happening. Too loose, harmful content slips through. Too strict, real users get blocked. The fix is not a better metric. It is a 2x2 matrix per category: precision and recall measured separately on an adversarial set and a benign set, for every policy category you ship. Skip any quadrant and you ship one of two failures, both of which kill the product.
This guide is the methodology trust-and-safety and ML engineers can use to evaluate LLM content moderation properly in 2026: how to build the test sets, how to score the matrix, how to tune per-category thresholds against the precision-recall curve, and how to keep the system calibrated when the content distribution drifts month over month.
TL;DR
| Decision | The right framing |
|---|---|
| Scoring | Precision AND recall on adversarial AND benign sets, per category. Never a single F1. |
| Test sets | 500-2,000 adversarial per category; 2,000-10,000 benign per category, manually verified |
| Thresholds | Per-category from the precision-recall curve, matched to category cost asymmetry |
| Drift | Monthly recalibration at the floor; weekly when adversarial probing is active |
| Stack | Protect inline (65 ms text), open-weight ensemble for batch review, ai-evaluation in CI |
The thesis is the rest of the post. Moderation eval is not a tuning problem; it is a coverage problem. The fix is discipline around what you measure and how often.
Why one F1 hides the failure
Engineers who came up through ML score everything with a single F1 and ship the model with the highest number. That works for an offline benchmark on a balanced test set. It falls apart on a content moderation system where the failure modes are asymmetric and the test sets you actually deploy against are not balanced.
A moderation system makes a binary call on every piece of content: violate or allow. It fails two ways. Too loose, harmful content slips through (low recall on the adversarial set). Too strict, legitimate queries get blocked (low precision on the benign set). A mixed F1 averages the two failures into one number that hides which one your model has.
The costs are also different per category. Over-blocking hate speech irritates a user. Under-blocking self-harm is a safety incident with a regulator behind it. Over-blocking humor is a churn metric. Under-blocking CSAM is a federal report. A single global F1 averages over those price tags and lies about your risk.
The example that makes the point. A classifier with 0.91 F1 on a mixed set might score 0.97 recall on jailbreaks and 0.99 precision on real customer support traffic. The same 0.91 F1 might score 0.85 recall on self-harm and 0.94 precision on legitimate mental-health questions. Same F1, very different products. One you ship; one you do not. The single number cannot tell you which.
The 2x2 matrix per category
The methodology that fixes this is straightforward. For every policy category you ship, you maintain four numbers across two test sets.
| Set | Quadrant | What it answers |
|---|---|---|
| Adversarial | Recall per category | When content is violating, do you catch it? |
| Adversarial | Precision per category | Of what you blocked, was it really violating? |
| Benign | Precision per category | Of what you blocked, was it actually fine? |
| Benign | Recall per category | When content is fine, did you allow it? |
The two diagonals do the load-bearing work. Recall on adversarial tells you what slips. Precision on benign tells you who you block who should not be blocked. The other two quadrants sanity-check the test sets themselves: low precision on adversarial means the set is contaminated with benign content; low recall on benign means it is contaminated with violations. Both are signals the labels are noise, not signals the model is bad.
Skip any quadrant and you ship one of two failures. Skip recall on adversarial and harmful content slips. Skip precision on benign and you over-block real users. Both kill the product. A working moderation eval is the discipline of measuring both, per category, every month.
The bars that work in 2026, calibrated to category cost asymmetry:
- Safety-critical (self-harm, CSAM, violent extremism). Recall on adversarial above 0.97. Precision on benign above 0.99.
- High-stakes (hate, harassment, sexual). Recall on adversarial above 0.95. Precision on benign above 0.99.
- Policy-restricted (political, medical, financial advice). Recall above 0.90. Precision above 0.995. Over-blocking is the dominant cost.
- Routine (toxicity, profanity). Recall above 0.90. Precision above 0.99.
Precision on benign is the floor under every category. Drop it below 0.97 anywhere and the moderator queue grows faster than the team. F1 cannot enforce that floor. The 2x2 can.
Building the adversarial and benign sets
Every other axis depends on the test sets. Get this wrong and the rest of the pipeline measures the wrong thing very precisely.
The adversarial set. 500 to 2,000 examples per category. Draw from public corpora: jailbreaks from JailbreakBench, prompt injections from PromptInject and Garak, domain-specific attacks from your own production logs. Weight toward hard positives (almost-allowed but violating) and the obfuscation patterns in your traffic: Unicode confusables, base64, role-play wrappers, multi-turn setup. Easy positives the model already catches; hard positives are where the threshold gets tested.
The benign set. 2,000 to 10,000 examples per category, sampled from real production traffic and manually verified as legitimate. This is the set vendors do not ship and the one that decides whether your moderation system is usable. Stratify by content type and by language. Include adversarial-looking but legitimate examples: a coder asking about SQL injection, a security researcher asking about jailbreak patterns, a medical professional discussing dosages. The benign set is where you discover the model has no idea what context means.
Three rules under both sets:
- Two annotators per example, Cohen’s kappa computed per category. If kappa is below 0.7, the category definition is broken. Stop labeling, rewrite the rubric, relabel 200 examples, recheck kappa, then scale up. Running an eval on labels with kappa of 0.55 is wasted.
- Refresh weekly from production failures. Pull the false positives reviewers reverse and the false negatives users surface. Static labeled sets rot in three months because the content distribution shifts and attackers adapt.
- Stratify by language from day one. A weighted-average F1 of 0.84 across English (0.92), Spanish (0.78), and Hindi (0.71) tells you nothing useful. Tag every example and run the 2x2 per (category, language) pair.
The dataset-refresh discipline that makes LLM dataset management work for chatbot eval applies here, with the kappa floor as the moderation-specific addition. Human-versus-LLM annotation tradeoffs covers why two human annotators with kappa monitoring beat a single LLM auto-labeler for high-stakes labels.
Threshold tuning from the precision-recall curve
A single global confidence threshold is the most common moderation anti-pattern. The right threshold for hate speech (where false negatives are expensive) is different from the right threshold for political content (where false positives are costly), and English is different from Hindi. A 0.7 cutoff across all categories and all languages is a hidden policy that says they all have the same cost structure. They do not.
The workflow that holds up:
- Score the held-out adversarial and benign sets on every category.
- Sweep the confidence threshold from 0.3 to 0.9 in 13 steps.
- Plot the precision-recall curve per (category, language) pair on benign; plot recall-at-threshold per pair on adversarial.
- Pick the operating point that matches the category’s cost asymmetry, using the bars from the section above.
- Log the rationale. The threshold is a policy decision, not a hyperparameter. The next on-call engineer needs to see why political is 0.85 and self-harm is 0.55.
The Future AGI ThresholdCalibrator runs this sweep and returns the per-(category, language) operating point with the precision and recall at that point. A calibration that hits the recall bar by trading precision down to 0.94 on benign doubles your reviewer workload, which may or may not be the trade you want.
from fi.evals import Evaluator, ThresholdCalibrator
evaluator = Evaluator(fi_api_key=API_KEY, fi_secret_key=SECRET_KEY)
calibrator = ThresholdCalibrator(
evaluator=evaluator,
range=(0.3, 0.9),
steps=13,
metric="precision_recall_curve",
stratify_by=["category", "language"],
)
config = calibrator.calibrate(
adversarial=adversarial_set,
benign=benign_set,
targets={
"self_harm": {"recall_min": 0.97, "precision_min": 0.99},
"hate": {"recall_min": 0.95, "precision_min": 0.99},
"political": {"recall_min": 0.90, "precision_min": 0.995},
"toxicity": {"recall_min": 0.90, "precision_min": 0.99},
},
)
The targets dict makes the policy decision explicit per category. If a category cannot hit both bars at any threshold, the calibrator returns the closest point and flags the gap. That flag is the signal to upgrade the backend, ensemble a second model, or rewrite the category rubric. The surrounding CI pattern lives in CI/CD LLM eval with GitHub Actions.
Calibration over time: the monthly drift loop
A threshold that was 0.72 with 0.96 recall in February sits at 0.72 with 0.89 recall by May because the attack surface moved. Adversarial campaigns adapt. Platform population shifts. Language drifts. A moderation system that recalibrates annually is a moderation system that ships drift quarterly.
The discipline that works is a monthly recalibration. Weekly when adversarial probing is active.
- Pull the failures. Every false positive reviewers reversed in the last 30 days. Every false negative a user, reporter, or regulator surfaced. Tag with category and language.
- Add them to the held-out set. Hard negatives and hard positives weighted heavier than older examples. Older examples decay in weight rather than getting dropped, so the set keeps memory of attacks the platform already learned.
- Rerun the 2x2 per category. Read the four numbers per (category, language) pair against last month’s.
- Diff the curve. Has precision on benign dropped on any category? Has recall on adversarial dropped? If yes, recalibrate the threshold; if the curve itself shifted unfavorably (precision drops at every threshold), the backend needs an upgrade, not a threshold change.
- Log the rationale. Every threshold change goes into a versioned policy file with the date, the metric delta, and the operator who signed off. The SOC 2 reviewer reads this file.
The named effect to watch: threshold inertia. Teams set the threshold once at launch, and the precision-recall curve drifts under it for six months while the dashboard stays green on stale data. The fix is the calendar, not the model. A 30-minute monthly check catches drift that would otherwise surface as an incident.
Capturing moderation decisions in production
Offline eval catches what your data covers. Production observation catches what the data missed. Every moderation decision should become a structured trace span so the failure and the decision live in the same place.
traceAI (Apache 2.0, OpenTelemetry-native) ships a GUARDRAIL span kind with the semantic conventions: guardrail.category, guardrail.score, guardrail.action (allow / warn / human_review / block), guardrail.language, guardrail.backend. With those attributes on every span, a single OTel query gives you per-(category, language) precision drift over the last seven days, broken down by backend.
from fi_instrumentation import register
from fi_instrumentation.types import ProjectType
trace_provider = register(
project_type=ProjectType.OBSERVE,
project_name="content-moderation-prod",
)
# Every moderation decision automatically carries:
# fi.span.kind = "GUARDRAIL"
# guardrail.category, guardrail.score, guardrail.action
# guardrail.language, guardrail.backend
The general pattern is covered in best AI agent observability tools; the pitfall of systems that pass offline eval and fail in production lives in agent observability vs evaluation vs benchmarking.
How Future AGI ships the moderation stack
Three components, one loop. The split mirrors what production trust-and-safety teams actually wire together.
Future AGI Protect is the runtime ML guardrail. Four Gemma 3n LoRA adapters cover toxicity, bias_detection, prompt_injection, and data_privacy_compliance, plus a Protect Flash binary classifier for the fast-path harmful/safe check. Median time-to-label of 65 ms text and 107 ms image per arXiv 2510.13351, which clears the 100 ms inline-UGC budget. Adapter weights are closed; the ML hop runs from api.futureagi.com. The agentcc-gateway plugin self-hosts in your VPC and carries deterministic regex and lexicon fallbacks (18 PII entity types, six prompt-injection patterns, five content-moderation keyword categories) so a network blip degrades to regex rather than failing open.
The ai-evaluation SDK (Apache 2.0) is where the 2x2 matrix gets scored and the thresholds get tuned. Thirteen guardrail backends (nine open-weight: LLAMAGUARD_3_8B/1B, QWEN3GUARD_8B/4B/0.6B, GRANITE_GUARDIAN_8B/5B, WILDGUARD_7B, SHIELDGEMMA_2B; four API: OPENAI_MODERATION, AZURE_CONTENT_SAFETY, TURING_FLASH, TURING_SAFETY) give you the air-gapped option and ensemble candidates. The Guardrails class wraps them with RailType.INPUT/OUTPUT/RETRIEVAL placement and AggregationStrategy.ANY/ALL/MAJORITY/WEIGHTED voting. ThresholdCalibrator runs the per-(category, language) sweep. Eight sub-10 ms heuristic Scanners catch deterministic patterns before the ML budget gets spent.
Agent Command Center is the deployment surface. 18+ built-in scanners plus 15 third-party adapters on the gateway path, per-tenant policy pipelines, five-level hierarchical budgets, and audit headers on every response (x-agentcc-guardrail-triggered, x-agentcc-cost, x-agentcc-latency-ms). Single Go binary, Apache 2.0, OpenAI-compatible at https://gateway.futureagi.com/v1 or self-hosted. Benchmarked at ~29k requests/second with P99 ≤ 21 ms with guardrails on, on t3.xlarge. SOC 2 Type II, HIPAA, GDPR, and CCPA certified per futureagi.com/trust.
The loop: ai-evaluation scores the 2x2 in CI, Agent Command Center enforces in production, Protect runs the inline ML, traceAI captures the audit trail, and Error Feed clusters production failures with HDBSCAN before a Sonnet 4.5 Judge writes an immediate_fix per cluster. Those fixes feed the Platform’s self-improving evaluators, which retune per-(category, language) thresholds without a manual sweep. Today the only ticketing integration is Linear; Slack, GitHub, Jira, and PagerDuty are roadmap.
Where Future AGI fits in the vendor landscape
Honest framing on the alternatives.
- OpenAI Moderation, Azure Content Safety. Hyperscaler endpoints, free or near-free, managed updates, network hop. Right for prototypes and as the second model in an ensemble. Both tend to over-block on technical content (security research, medical context, code), which is what the benign set surfaces.
- Lakera Guard, Guardrails AI, NeMo Guardrails. Lakera ships strong dashboards on prompt injection and PII; closed evaluation set is the limiting factor for teams that want to score it on their own corpus. Guardrails AI’s validator hub is composition-first; quality is uneven. NeMo Guardrails is the right pick when policy flow logic is the thing and your team will learn Colang.
- Open-weight only. Llama Guard 3, Qwen3-Guard, Granite Guardian, WildGuard, ShieldGemma each have category strengths and gaps. None in isolation is the answer; two with non-overlapping strengths ensemble well for high-stakes categories. The Ultimate Guide to LLM Guardrails (2026) covers the per-model trade-offs.
- Future AGI. Protect is the inline ML guardrail (65 ms, four categories, closed weights, hosted ML hop with VPC-local fallbacks). ai-evaluation is the open-source eval and ensemble layer (Apache 2.0, the 13 backends, the calibrator). Agent Command Center is the gateway and policy surface (audit headers, per-tenant policy, SOC 2). The loop between the three is what makes the monthly recalibration economical instead of aspirational.
For the broader runtime story across all four placement rails (input, output, retrieval, tool-call), LLM guardrails for safeguarding AI is the parallel chapter to this post’s eval focus.
Common mistakes that break moderation eval
- Single F1 on a mixed set. Score precision on benign and recall on adversarial, per category. The number you optimize against is the number you fail against in production.
- Single global threshold. Hate, political content, and humor do not share a cost structure. English and Hindi do not share a precision-recall surface. Calibrate per-(category, language).
- No benign set. Adversarial is easy from public corpora. Benign takes manual verification of real customer traffic. Teams that skip it ship over-blocking at scale and find out from the churn dashboard.
- No annotator-agreement floor. Precision and recall on labels with Cohen’s kappa of 0.5 are not measurements; they are wishful thinking. Compute kappa per category before you compute anything else.
- No drift loop. A monthly recalibration is the floor. Threshold inertia is what ships harm in May after the model was great in February.
- No audit trail. SOC 2 asks what blocked the output and why. The answer lives in the trace tree, not a screenshot.
For the precision-recall-F1 math, custom LLM eval metrics best practices goes deeper; the 2026 LLM evaluation playbook is the umbrella reference.
The bar to ship in 2026
Moderation eval is one of the cases where the difference between a notebook and a production system is the discipline around the four numbers. Per category, per language, every month. Two test sets you maintain yourself. A threshold that is a policy decision, not a tuning knob. A loop that catches drift before reviewers do.
The Future AGI eval stack ships the pieces: Protect at 65 ms inline, the open-weight ensemble and calibrator in ai-evaluation, the gateway and audit trail in Agent Command Center, and the HDBSCAN-plus-Sonnet drift loop in Error Feed. The 2x2 per category is the methodology that keeps both failure modes visible at the same time.
Frequently asked questions
Why is a single F1 the wrong score for a content moderation system?
What is the 2x2 matrix per category methodology?
How big do the adversarial and benign test sets need to be per category?
How do I tune per-category thresholds in production?
How often does a content moderation system need recalibration?
Where does Future AGI Protect fit versus the open-weight classifiers?
How do Future AGI's eval stack and Agent Command Center close the moderation drift loop?
The definitive 2026 reference: three generation patterns (persona, taxonomy-stratified, evolution), the filter that survives, calibration against real, and three use cases.
Deterministic vs LLM-judge isn't a pick. It's a cascade. Where each wins, where each breaks, and the layering that drops eval cost 95% in production.
Bedrock's built-in eval is dev-loop only. Score action-group correctness, KB retrieval quality, and guardrail precision/recall on every release.