What Is a Politeness Metric?
A politeness metric scores whether an LLM response is respectful, considerate, and appropriate for the user, task, and conversation context.
What Is a Politeness Metric?
A politeness metric is an LLM-evaluation metric that scores whether a model response stays respectful, considerate, and professionally appropriate for the user and task. It shows up in eval pipelines, support-agent traces, and regression suites when teams need to catch rude phrasing, dismissive refusals, over-apologies, or culturally insensitive tone before they reach users. FutureAGI maps this surface to the IsPolite evaluator, often paired with Tone, Toxicity, and user-escalation signals.
Why Politeness Metrics Matter in Production LLM and Agent Systems
Politeness failures create product risk even when the answer is factually correct. A support agent can give the right refund policy while sounding dismissive. A medical intake assistant can ask a required follow-up in a way that feels accusatory. A coding agent can reject a request with phrasing that makes the user think the system is broken or hostile. The named failure modes are tone drift, rude refusal, and brand-risk escalation: the system completes the turn, but the user loses trust.
The pain shows up across teams. Developers see traces with successful model calls and low latency, yet the final response causes complaints. SREs see longer sessions, repeated clarification turns, and more handoffs to humans without an obvious outage. Compliance and support leaders see escalation notes such as “agent was rude” or “would not listen,” which are hard to debug after the fact if the eval suite only checks correctness.
Politeness is especially relevant for 2026-era agentic systems because tone can degrade across planning, retrieval, tool errors, and handoffs. One tool timeout can trigger an impatient refusal; one retrieval miss can produce a condescending “as I already said” response. Unlike generic sentiment analysis, a politeness metric is task-relative: a firm fraud warning can be negative in sentiment and still be polite. The metric keeps product teams from confusing “friendly” with “safe to ship.”
How FutureAGI Handles Politeness Metrics
FutureAGI’s approach is to treat politeness as a measurable eval surface from /platform/evaluate, not a brand-style note buried in the system prompt. The anchor for this glossary entry is eval:IsPolite, which maps to the IsPolite cloud-template evaluator named is_polite in the FutureAGI inventory. Teams use it on dataset rows, sampled production responses, and regression suites where respectfulness is part of the release gate.
A practical workflow starts with a customer-support or healthcare-assistant dataset: user message, assistant response, task category, locale, and optional reviewer label. The engineer attaches IsPolite as an evaluation, runs it beside Tone and Toxicity, then groups the results by prompt version, model (Claude Opus 4.7, GPT-5.x), route, and customer segment. If a new refusal prompt increases the politeness-fail-rate for billing disputes, the deploy is held, failed rows move to an annotation queue, and the prompt owner rewrites the refusal pattern before rerunning the regression eval.
In production, the same check can run on traces captured through traceAI-langchain or traceAI-openai. When a rude answer appears after a tool error, the trace can point to the route, prompt version, and agent.trajectory.step that preceded the language shift. FutureAGI pairs that with user feedback so the metric does not become a pure style score. A response that is polite but unhelpful still needs IsHelpful; a polite unsafe answer still needs ContentSafety or IsCompliant. Unlike Galileo’s prompt-only tone label, this scoring sits on full multi-turn traces.
How to Measure or Detect a Politeness Metric
Useful signals combine evaluator output, trace context, and user feedback:
IsPolite- evaluates whether the response meets politeness criteria and returns a scoring result that can be thresholded for release checks.Tone- helps distinguish politeness from warmth, formality, empathy, or brand voice.ToxicityandContentSafety- catch harmful language that a politeness-only check should not be asked to own.- Trace fields - group failures by prompt version, model, route, tool error, locale, and
agent.trajectory.step. - Dashboard signals - track politeness-fail-rate-by-cohort, thumbs-down rate, human escalation rate, and complaint tags.
Minimal Python pattern:
from fi.evals import IsPolite
metric = IsPolite()
result = metric.evaluate(input=user_message, response=assistant_reply)
print(result.score)
Set thresholds from reviewed examples, not from a single global average. For multilingual or regulated workflows, sample failures into human annotation so the metric reflects local expectations without rewarding evasive or overly apologetic answers. Review borderline failures weekly. They are where teams find prompt instructions that sound acceptable in English but become abrupt after translation, truncation, or tool-error handling.
| Metric | What it scores | Typical failure | FAGI evaluator |
|---|---|---|---|
| Politeness | Respect, civility | Dismissive refusal | IsPolite |
| Tone | Warmth, formality, brand voice | Cold but polite | Tone |
| Helpfulness | Resolves user need | Polite but unhelpful | IsHelpful |
| Toxicity | Harmful language | Direct insult | Toxicity |
| Content safety | Policy compliance | Polite-but-unsafe | ContentSafety |
| Sentiment | Affective polarity | Negative ≠ rude | sentiment scorers |
For external calibration of tone-style evaluators, HHH (Helpful-Honest-Harmless preference dataset; the original RLHF training signal for Claude-style helpfulness) and BeaverTails (~333K labeled QA pairs across 14 harm categories) are the standard 2026 references. Frontier-model politeness violations on adversarial customer-support prompts cluster in the 3-9% range pre-guardrail; light judge calibration brings that to <1% on locale-matched gold sets. which is the realistic ceiling when tuning the IsPolite threshold.
Calibrating politeness across locales
Politeness is one of the few LLM eval signals that has hard cultural variance. A response that reads as polite in US business English can read as cold in Indian customer service or overly familiar in Japanese formal contexts. A flat IsPolite threshold trained on English support tickets will misclassify on every other locale.
The 2026 calibration pattern we use has three steps. First, label per locale: a politeness gold set with at least 100 examples per locale, reviewed by native-speaker annotators with the same rubric. Second, threshold per locale: the IsPolite cutoff for shipping is set on locale-specific human agreement, not a global average. Third, route per locale: when the gateway routes by user locale or language, the post-guardrail loads the locale-specific threshold rather than the global one.
The judge model also matters. GPT-5.x as judge tends to over-reward verbose courtesy; Claude Opus 4.7 as judge is closer to neutral but still has documented bias toward Western etiquette. We rotate judge models on the politeness gold set every quarter and adjust thresholds when judge disagreement exceeds 8%. the bias does not disappear, but it stops compounding silently. Unlike Galileo’s prompt-only tone label, this judge-rotation pattern catches the cases where the judge itself is drifting on culturally loaded phrasing.
Finally, track refusal-tone separately. A polite refusal is different from a polite answer; both should pass the metric, but the release gate often needs to know which path was taken.
Common Mistakes
These mistakes make politeness look green while users still feel ignored:
- Treating politeness as sentiment. A negative but respectful refusal should pass; a cheerful answer can still be condescending.
- Measuring only English. Politeness markers shift by locale; multilingual support needs reviewed examples and
CulturalSensitivitychecks. - Penalizing every apology. Some regulated workflows require a brief apology; score repeated apologies separately from respectfulness.
- Hiding policy failures. Politeness cannot make an unsafe or noncompliant answer acceptable; pair it with
ContentSafetyandIsCompliant. - Using one threshold for every channel. Sales outreach, clinical triage, and developer support need separate baselines.
Review mistakes by workflow, not only by model. The same model may be polite in general chat and rude after a refund denial, timeout, or failed tool call.
Frequently Asked Questions
What is a politeness metric?
A politeness metric scores whether an LLM response is respectful, considerate, and professionally appropriate for the user. FutureAGI evaluates this with IsPolite across datasets, traces, and regression suites.
How is a politeness metric different from tone evaluation?
A politeness metric focuses on respectfulness and social appropriateness. Tone evaluation is broader; it can score warmth, formality, empathy, confidence, or brand voice even when the response is already polite.
How do you measure a politeness metric?
Use FutureAGI's IsPolite evaluator on model responses, then segment results by prompt version, model, route, and user cohort. Pair it with Tone, Toxicity, and escalation-rate signals.