How is a politeness metric different from tone evaluation?

A politeness metric focuses on respectfulness and social appropriateness. Tone evaluation is broader; it can score warmth, formality, empathy, confidence, or brand voice even when the response is already polite.

How do you measure a politeness metric?

Use FutureAGI's IsPolite evaluator on model responses, then segment results by prompt version, model, route, and user cohort. Pair it with Tone, Toxicity, and escalation-rate signals.

What Is a Politeness Metric? Definition & FutureAGI Guide (2026)

Q: What is a politeness metric?

A politeness metric scores whether an LLM response is respectful, considerate, and professionally appropriate for the user. FutureAGI evaluates this with IsPolite across datasets, traces, and regression suites.

What Is a Politeness Metric?

A politeness metric is an LLM-evaluation metric that scores whether a model response stays respectful, considerate, and professionally appropriate for the user and task. It shows up in eval pipelines, support-agent traces, and regression suites when teams need to catch rude phrasing, dismissive refusals, over-apologies, or culturally insensitive tone before they reach users. FutureAGI maps this surface to the IsPolite evaluator, often paired with Tone, Toxicity, and user-escalation signals.

Why Politeness Metrics Matter in Production LLM and Agent Systems

Politeness failures create product risk even when the answer is factually correct. A support agent can give the right refund policy while sounding dismissive. A medical intake assistant can ask a required follow-up in a way that feels accusatory. A coding agent can reject a request with phrasing that makes the user think the system is broken or hostile. The named failure modes are tone drift, rude refusal, and brand-risk escalation: the system completes the turn, but the user loses trust.

The pain shows up across teams. Developers see traces with successful model calls and low latency, yet the final response causes complaints. SREs see longer sessions, repeated clarification turns, and more handoffs to humans without an obvious outage. Compliance and support leaders see escalation notes such as “agent was rude” or “would not listen,” which are hard to debug after the fact if the eval suite only checks correctness.

Politeness is especially relevant for 2026-era agentic systems because tone can degrade across planning, retrieval, tool errors, and handoffs. One tool timeout can trigger an impatient refusal; one retrieval miss can produce a condescending “as I already said” response. Unlike generic sentiment analysis, a politeness metric is task-relative: a firm fraud warning can be negative in sentiment and still be polite. The metric keeps product teams from confusing “friendly” with “safe to ship.”

How FutureAGI Handles Politeness Metrics

FutureAGI’s approach is to treat politeness as a measurable eval surface, not a brand-style note buried in the system prompt. The anchor for this glossary entry is eval:IsPolite, which maps to the IsPolite cloud-template evaluator named is_polite in the FutureAGI inventory. Teams use it on dataset rows, sampled production responses, and regression suites where respectfulness is part of the release gate.

A practical workflow starts with a customer-support or healthcare-assistant dataset: user message, assistant response, task category, locale, and optional reviewer label. The engineer attaches IsPolite as an evaluation, runs it beside Tone and Toxicity, then groups the results by prompt version, model, route, and customer segment. If a new refusal prompt increases the politeness-fail-rate for billing disputes, the deploy is held, failed rows move to an annotation queue, and the prompt owner rewrites the refusal pattern before rerunning the regression eval.

In production, the same check can run on traces captured through traceAI-langchain or traceAI-openai. When a rude answer appears after a tool error, the trace can point to the route, prompt version, and agent.trajectory.step that preceded the language shift. FutureAGI pairs that with user feedback so the metric does not become a pure style score. A response that is polite but unhelpful still needs IsHelpful; a polite unsafe answer still needs ContentSafety or IsCompliant.

How to Measure or Detect a Politeness Metric

Useful signals combine evaluator output, trace context, and user feedback:

IsPolite - evaluates whether the response meets politeness criteria and returns a scoring result that can be thresholded for release checks.
Tone - helps distinguish politeness from warmth, formality, empathy, or brand voice.
Toxicity and ContentSafety - catch harmful language that a politeness-only check should not be asked to own.
Trace fields - group failures by prompt version, model, route, tool error, locale, and agent.trajectory.step.
Dashboard signals - track politeness-fail-rate-by-cohort, thumbs-down rate, human escalation rate, and complaint tags.

Minimal Python pattern:

from fi.evals import IsPolite

metric = IsPolite()
result = metric.evaluate(input=user_message, response=assistant_reply)
print(result.score)

Set thresholds from reviewed examples, not from a single global average. For multilingual or regulated workflows, sample failures into human annotation so the metric reflects local expectations without rewarding evasive or overly apologetic answers. Review borderline failures weekly. They are where teams find prompt instructions that sound acceptable in English but become abrupt after translation, truncation, or tool-error handling.

Common Mistakes

These mistakes make politeness look green while users still feel ignored:

Treating politeness as sentiment. A negative but respectful refusal should pass; a cheerful answer can still be condescending.
Measuring only English. Politeness markers shift by locale; multilingual support needs reviewed examples and CulturalSensitivity checks.
Penalizing every apology. Some regulated workflows require a brief apology; score repeated apologies separately from respectfulness.
Hiding policy failures. Politeness cannot make an unsafe or noncompliant answer acceptable; pair it with ContentSafety and IsCompliant.
Using one threshold for every channel. Sales outreach, clinical triage, and developer support need separate baselines.

Review mistakes by workflow, not only by model. The same model may be polite in general chat and rude after a refund denial, timeout, or failed tool call.