What Is a Tone Metric?
An LLM-evaluation metric that classifies or scores the emotional and stylistic register of a model output — polite, empathetic, professional, off-brand.
What Is a Tone Metric?
A tone metric is an LLM-evaluation signal that classifies or scores the emotional and stylistic register of a model output — polite, professional, empathetic, casual, dismissive, or off-brand. It is typically a judge-model evaluator graded against a rubric: an LLM reads the output and returns a category or 0–1 score. Tone metrics live alongside content metrics like Faithfulness and TaskCompletion because correct content delivered in the wrong register is still a brand problem. FutureAGI exposes tone via fi.evals’ Tone, IsPolite, ClinicallyInappropriateTone, and IsInformalTone evaluators.
Why It Matters in Production LLM and Agent Systems
Brand voice is the slowest-failing layer of any LLM deployment — and the loudest when it does fail. A correct refund policy delivered in a snarky register turns a satisfied customer into a churn case. A medically accurate response delivered with casual slang invites compliance escalation. An empathetic apology delivered to a user who wanted a fast factual answer reads as condescending. Tone matters even when content is right.
Application engineers feel this when prompt edits to improve task accuracy silently drift the agent’s register. CX leads feel it when CSAT drops without any change in resolution rate — the resolution itself sounded different. Compliance leads feel it during regulated-industry audits where “professional and clinically appropriate tone” is a standing requirement. Brand teams feel it when the agent starts using emojis on a B2B product or apologizing reflexively in cases where the policy is the policy.
For 2026 voice and agent stacks the tone surface multiplies. Voice TTS adds prosody, pacing, and emotional shading that pure text tone metrics miss — IsPolite on a transcript can pass while AudioQualityEvaluator flags a flat, robotic delivery. Multi-step agents let tone drift across the trajectory: the planner sounds polite, the tool-result formatter sounds curt, the final-answer wrapper sounds rushed. Tone has to be measured at every step, not just at the final response.
How FutureAGI Handles Tone Metrics
FutureAGI’s approach is to ship tone as multiple targeted evaluators rather than a single “is the tone good” judgement, because what counts as on-tone differs by industry. The fi.evals package exposes Tone for general register classification (polite, neutral, harsh, frustrated, etc.), IsPolite for politeness scoring, IsInformalTone for register policing, ClinicallyInappropriateTone for healthcare contexts, and CulturalSensitivity for cross-cultural brand voice. Each evaluator returns a score plus a reason string explaining the judgement.
A real workflow: a fintech support team ships an agent with a “professional, empathetic, plain-language” tone rubric. They build a Dataset of 800 synthetic and production-sampled support cases, attach Tone, IsPolite, and IsInformalTone evaluators, and run a regression eval on every prompt change. When a prompt edit improves TaskCompletion by 4 points but drops IsPolite from 0.92 to 0.81, the engineer sees the trade and either reverts or adds a polite-register reinforcement to the system prompt. In production, the same evaluators run on a 5% trace sample via traceAI; eval-fail-rate-by-cohort surfaces tone drift by user segment.
For voice agents, tone evaluators run on the transcript while AudioQualityEvaluator runs on the audio — both are needed because polite words delivered with bad prosody still feel rude. Unlike a generic sentiment classifier that scores polarity, FutureAGI’s tone family is rubric-graded per domain.
How to Measure or Detect It
Tone is multi-dimensional; pick the evaluators that match your contract:
Tone: classifies the register of an output across categories (polite, neutral, frustrated, etc.) with a score and reason.IsPolite: scalar 0–1 politeness judgement; useful as a regression anchor across releases.IsInformalTone: catches casual register on outputs that should be formal (legal, medical, finance).ClinicallyInappropriateTone: domain-specific evaluator for healthcare and clinical contexts.CulturalSensitivity: catches register issues that are tone-neutral in one locale and rude in another.- Eval-fail-rate-by-cohort: percentage of outputs failing tone evaluators per user segment, channel, or persona — the canonical regression alarm.
Minimal Python:
from fi.evals import Tone, IsPolite
tone = Tone()
polite = IsPolite()
tone_result = tone.evaluate(input=user_msg, output=agent_reply)
polite_result = polite.evaluate(input=user_msg, output=agent_reply)
print(tone_result.score, polite_result.score)
Common Mistakes
- Treating sentiment analysis as a tone metric. Sentiment scores the user; tone scores the model. Conflating them produces nonsense alerts when a user is frustrated and the agent is appropriately apologetic.
- One judge for all tones. A single LLM judge with a vague “is the tone good” prompt drifts; use targeted evaluators with explicit rubrics.
- Ignoring multi-step tone drift. Tone can be polite at step 1 and curt at step 5; evaluate per step, not just per final answer.
- Skipping voice prosody. Text tone metrics pass while users hear flat or rude delivery; pair tone evaluators with audio quality scores for voice agents.
- No baseline cohort. Tone scores need a per-domain baseline; “0.85 IsPolite” means nothing without a healthy-state reference.
Frequently Asked Questions
What is a tone metric?
A tone metric is an LLM-evaluation signal that classifies or scores the emotional and stylistic register of a model output — polite, professional, empathetic, casual, or off-brand. It is typically graded by a judge model against a rubric.
How is a tone metric different from sentiment analysis?
Sentiment analysis classifies user input emotion (positive, neutral, negative). A tone metric scores the model's output register — whether the agent's reply is polite, professional, or on-brand — independent of user sentiment.
How do you measure tone in an LLM application?
FutureAGI exposes the Tone evaluator for general register classification, IsPolite for politeness scoring, and domain-specific evaluators like ClinicallyInappropriateTone or IsInformalTone for regulated use cases.