Guides

Evaluating LLM Personas, Style, and Persona Drift Across Turns

Persona eval is two problems: per-turn tone and cross-turn drift. The 2026 playbook for tone rubrics plus a persona-stability score across turns.

·
Updated
·
11 min read
llm-evaluation persona brand-voice style tone persona-drift 2026
Editorial cover image for Evaluating LLM Personas, Style, and Voice Consistency (2026)
Table of Contents

A companion agent scores 4.6 out of 5 on the brand voice rubric over the first 200 turns of the launch week. The product team ships. Two weeks later a Discord screenshot goes viral: turn 27 of a long conversation, the agent has dropped contractions, picked up corporate hedging, opened with “I sincerely apologise for any inconvenience.” Each individual turn was fine. The conversation, taken as a sequence, walked the persona off a cliff.

The opinion this post earns: persona eval is two problems, not one. Per-turn style adherence catches the obvious off-brand reply. Persona drift across the trajectory is what catches the slow collapse into base-model default that ships when every turn passes the per-turn floor. Most teams check turn-level tone and call it done. The persona then erodes silently across a 30-turn conversation, and the failure shows up as a screenshot, not a metric.

This guide is the working pattern for evaluating LLM persona, style, and voice consistency in 2026: how to design a tone rubric, how to score persona drift across the full trajectory using embedding distance plus a stability rubric, how to calibrate judges against human raters, and how the same rubrics attach to live session spans through traceAI and the ai-evaluation SDK. For the broader chatbot trajectory pattern, see the multi-turn conversations deep dive.

TL;DR: two units, one workflow

UnitWhat it scoresFailure it catchesPrimary rubric
Per-turn toneOne response against the voice descriptionGeneric helpful-assistant defaultTone (eval_id=16) + CustomLLMJudge
Per-turn refusal toneRefusal warmth and alternative pathCold “I cannot help with that”AnswerRefusal + CustomLLMJudge
Trajectory persona stabilityVoice variance across the full transcriptAgent relaxes into base-model default by turn 8CustomerAgentPromptConformance + embedding drift
Trajectory persona-aware refusalRefusals stay in voice across the sessionRefusals get colder as conversation lengthensCustomLLMJudge over transcript
Cross-language style transferPersona survives in Spanish, German, JapaneseVoice collapses to locale defaultCustomLLMJudge with per-language few-shot

The discipline: tone rubric per turn, persona-stability score across the trajectory, and a moving cosine-distance baseline from an in-brand centroid. Gate CI on per-trajectory floors, not the mean of turn scores.

Why per-turn tone scoring ships broken personas

A turn-level tone rubric scores (user_turn, assistant_turn) against the voice description. It cannot see the cumulative shape of the trajectory. Three failure modes recur in postmortems when teams shipped on per-turn means:

  • The slow relax. Turn 1 is sharp and warm. By turn 12 the agent has picked up corporate hedging, exclamation marks have started leaking, and the contractions are gone. No single turn fails the per-turn floor of 0.85. The conversation as a whole has drifted into a register the brand would not approve.
  • The pushback fold. User challenges the agent on turn 6 (“are you sure?”). The agent caves: “You are absolutely right, I apologise for the confusion.” Per-turn tone scores it as polite. Per-turn refusal-correctness scores it as compliant. The cross-turn reading is the agent dropped persona under pressure.
  • The register flip. User starts technical on turn 1. By turn 9 the user is frustrated and terse. The agent, still graded per turn against the same rubric, has not adjusted register and reads as tone-deaf. Each turn is in-brand on paper; the trajectory missed the emotional arc.

Long context windows do not fix this; they shift the failure pattern from “the model forgot the persona” to “the model has the persona description and ignores it across a long conversation,” which is harder to catch because every turn looks fine.

The shift is one sentence: stop scoring only (turn, voice_rubric) pairs and start also scoring transcript -> persona_stability. Everything below is what that costs to operationalise.

Designing the tone rubric

The per-turn tone rubric is the foundation; the trajectory rubric layers on top. A useful tone rubric has three parts.

A voice description, 5 to 7 sentences. Adjectives and contrasts beat lists. “Sharp and warm, not chirpy. Confident without being cocky. Direct without being curt. Comfortable with technical detail but allergic to jargon. Uses contractions. Never uses exclamation marks. Never says ‘I’m sorry to hear that.’” Adjective pairs that name what the voice is and what it is not give the judge a calibrated band, not a vague target.

Ten to twenty in-brand and out-of-brand example pairs. Same prompt, two responses, one in voice and one in the generic LLM default. Without these the judge averages toward generic helpfulness and every response scores around 4.0. With them, the band tightens and the floor becomes meaningful.

A small set of hard rules. No exclamation marks. No “I’m just an AI” disclaimers. No emoji in support tickets. These do not need a judge call; a regex pass is faster and cheaper.

Future AGI’s Tone template (eval_id=16) ships the generic per-turn pass. The persona-specific layer wraps CustomLLMJudge:

from fi.evals import Evaluator
from fi.evals.metrics import CustomLLMJudge
from fi.evals.llm.providers.litellm import LiteLLMProvider
from fi.testcases import TestCase

voice_rubric = CustomLLMJudge(
    provider=LiteLLMProvider(),
    config={
        "name": "brand_voice_per_turn",
        "model": "gpt-4.1",
        "grading_criteria": (
            "Score 1.0 to 5.0 how closely this response matches the brand voice. "
            "Voice: sharp and warm, not chirpy; direct without being curt; "
            "uses contractions; no exclamation marks; no 'I'm sorry to hear that.'; "
            "no 'I'm just an AI' disclaimers. "
            "5.0 = sounds exactly like the brand. "
            "3.0 = neutral, could be any LLM. "
            "1.0 = directly violates the voice rules. "
            "Provide a one-sentence reason citing the specific voice attribute that landed."
        ),
        "few_shot_examples": [
            {"input": "...", "in_brand": "...", "out_of_brand": "...", "score": 5.0},
            # 10-20 anchors
        ],
    },
)

evaluator = Evaluator()
result = evaluator.evaluate(
    eval_templates=[voice_rubric],
    inputs=[TestCase(input=user_msg, output=assistant_msg)],
)

The judge model is pinned. The few-shot examples are versioned. When the voice shifts — a new product line, a new audience, a tone update — the rubric ships a new version through the same review process as a prompt change.

The persona-stability score on the trajectory

A per-turn rubric cannot detect persona drift because the input shape is wrong. Drift is variance across a sequence; the rubric input has to be the sequence. Two complementary signals catch it.

Trajectory rubric (qualitative). CustomerAgentPromptConformance ships as the trajectory-level persona-adherence template in the SDK. The input is the full transcript plus the persona description; the score reflects whether the voice held or relaxed somewhere in the middle. Pair with a CustomLLMJudge that explicitly scores variance across turns.

Embedding drift (quantitative). Embed each assistant turn with a sentence encoder. Compute cosine distance from a centroid built from 50 to 100 in-brand exemplars. Rising mean distance across the session is a drift signal that pairs with the qualitative rubric and stays cheap to compute on every turn:

import numpy as np
from sentence_transformers import SentenceTransformer

encoder = SentenceTransformer("BAAI/bge-large-en-v1.5")
brand_centroid = encoder.encode(in_brand_exemplars).mean(axis=0)
brand_centroid /= np.linalg.norm(brand_centroid)

def turn_drift(turn_text: str) -> float:
    v = encoder.encode(turn_text)
    v /= np.linalg.norm(v)
    return float(1 - v @ brand_centroid)  # cosine distance

def trajectory_drift(turns: list[str]) -> dict:
    distances = [turn_drift(t) for t in turns]
    return {
        "mean_distance": float(np.mean(distances)),
        "slope": float(np.polyfit(range(len(distances)), distances, 1)[0]),
        "max_late_session": float(max(distances[len(distances)//2:])),
    }

Three numbers carry the signal. Mean cosine distance shows whether the conversation sat far from the centroid. Slope is the smoking gun for persona relax: positive slope means the voice drifted across the session. Max-late-session distance pinpoints the worst moment in the second half, which is where screenshots come from.

The layers compose. Embedding drift flags trajectories that warrant a judge call; the trajectory rubric writes the reason. A conversation with rising slope plus a CustomerAgentPromptConformance score below 0.85 is a real drift event. High slope with a high conformance score is usually a topic shift, not a persona break.

Calibrating against human raters

A voice rubric the team does not trust is a rubric nobody runs. Calibration is what earns the trust.

Pin a 100 to 200 example hold-out set. Two domain reviewers — the brand owner plus a senior product writer is the workable minimum — score each example independently on the 1-5 rubric. Compute the mean per example. Run the LLM judge on the same set and compute Cohen’s kappa between the judge score and the human mean. Target above 0.6 to ship; above 0.75 to trust the gate for a release decision.

Three habits earn the calibration back. Refresh the hold-out quarterly with recent in-brand examples; drop stale ones whose voice no longer matches the current product surface. Track inter-rater agreement between the two humans before judging the judge; disagreement above 1.0 on a 5-point scale on a third of the examples means the rubric is itself ambiguous, and the fix is rubric clarity, not a different judge. Recompute kappa every quarter; if it drops below 0.5, ship a rubric version bump with updated few-shot examples. Self-improvement handles incremental tuning between calibration windows; the major shifts need a human decision.

LLM-as-judge best practices covers the kappa-and-hold-out pattern across other rubric families.

Per-trajectory CI scoring

The CI gate runs the tone rubric per turn AND the persona-stability rubric per trajectory. Most teams set up the first and forget the second; that is the gap most personas ship through.

from fi.evals import Evaluator
from fi.evals.templates import (
    Tone, AnswerRefusal, CustomerAgentPromptConformance,
)
from fi.evals.metrics import CustomLLMJudge
from fi.evals.llm.providers.litellm import LiteLLMProvider
from fi.testcases import TestCase

evaluator = Evaluator()

TRAJECTORY_FLOORS = {
    "tone": 0.85,                                     # per-turn min
    "brand_voice_per_turn": 4.0,                      # per-turn min, 1-5
    "customer_agent_prompt_conformance": 0.88,        # trajectory min
    "persona_stability": 4.0,                         # trajectory min, 1-5
    "embedding_slope_max": 0.0015,                    # drift slope cap
    "refusal_tone_min": 4.0,                          # 1-5 on refusal turns
}

def test_trajectories(trajectory_dataset):
    failed = []
    for traj in trajectory_dataset:
        assistant_turns = [t["content"] for t in traj.history if t["role"] == "assistant"]
        per_turn = evaluator.evaluate(
            eval_templates=[Tone(), voice_rubric],
            inputs=[TestCase(input=u, output=a) for u, a in traj.turn_pairs],
        )
        trajectory_tc = TestCase(
            input=traj.transcript,
            expected_output=traj.persona_description,
            conversation=traj.history,
        )
        trajectory = evaluator.evaluate(
            eval_templates=[CustomerAgentPromptConformance(), persona_stability_rubric],
            inputs=[trajectory_tc],
        )
        drift = trajectory_drift(assistant_turns)
        scores = aggregate(per_turn, trajectory, drift)
        for metric, floor in TRAJECTORY_FLOORS.items():
            if scores[metric] < floor and metric != "embedding_slope_max":
                failed.append((traj.id, metric, scores[metric], floor))
            if metric == "embedding_slope_max" and scores["slope"] > floor:
                failed.append((traj.id, "slope", scores["slope"], floor))
    assert not failed, f"persona failures: {failed[:5]}"

Three habits separate a working persona gate from theatre. Score per trajectory, not just per turn. The persona-stability rubric input is the transcript; the per-turn rubric input is the pair. Both run. Stratify the dataset. Equal weight per persona-by-scenario cell; a single class of failure (pushback fold, register flip) cannot drag everything down and pass review on the happy-path floor. Diff against a moving baseline. Alarm on a 2-point sustained drop on the persona-stability rubric or a 30 percent jump in the embedding slope. Trajectory rubrics are noisier per-trajectory than per-turn scores; over-alarming on the noise floor kills the habit of looking at the gate.

Production observability and Error Feed

The CI gate catches the trajectory bugs you can think of. Production catches everything else. The bridge that makes both possible is the same: every span carries session.id and user.id, so per-turn LLM spans roll up into a conversation root span. The Tone rubric attaches per turn; the persona-stability rubric attaches to the conversation root.

from fi_instrumentation import register
from fi_instrumentation.fi_types import (
    ProjectType, EvalTag, EvalTagType, EvalSpanKind,
    EvalName, ModelChoices,
)
from openinference.instrumentation.openai import OpenAIInstrumentor

trace_provider = register(
    project_type=ProjectType.OBSERVE,
    project_name="companion-agent",
    eval_tags=[
        EvalTag(
            type=EvalTagType.OBSERVATION_SPAN,
            value=EvalSpanKind.LLM,
            eval_name=EvalName.TONE,
            model=ModelChoices.TURING_LARGE,
            mapping={"input": "input.value", "output": "output.value"},
        ),
        EvalTag(
            type=EvalTagType.OBSERVATION_SPAN,
            value=EvalSpanKind.CONVERSATION,
            eval_name=EvalName.PROMPT_INSTRUCTION_ADHERENCE,
            model=ModelChoices.TURING_LARGE,
            mapping={"input": "input.value", "output": "output.value"},
        ),
    ],
)
OpenAIInstrumentor().instrument(tracer_provider=trace_provider)

EvalSpanKind.LLM runs the per-turn tone pass on every turn; EvalSpanKind.CONVERSATION runs the trajectory-level persona check when the session closes. Sample the trajectory rubric at 5 to 10 percent of live traffic to keep judge cost bounded. Deterministic checks (banned phrases, exclamation marks, reading level, embedding drift) run on 100 percent of turns at sub-millisecond cost.

Four production-only persona signals to alarm on. Mean embedding distance per session length bucket. A rising mean on 20-plus-turn sessions while short sessions stay stable is the persona-relax signature. Persona-stability rubric drop on the long-session cohort. Same diagnosis, qualitative side. Refusal-tone score over time. Refusals tend to drift colder under pressure; a 2-point sustained drop is the early warning before the screenshot lands. Cross-language persona-stability gap. English holds, Spanish collapses to formal-default; the gap shows up as soon as the per-language rubrics share the same scale.

The Error Feed closes the loop. HDBSCAN soft-clustering over ClickHouse-stored span embeddings groups failing trajectories into named issues. A Claude Sonnet 4.5 Judge agent (30-turn budget, 8 span-tools, Haiku Chauffeur for spans over 3000 chars, prompt-cache hit near 90 percent) writes the RCA, evidence quotes, an immediate_fix, and a four-dimensional trace score (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution; 1-5 each). The clusters that show up on persona work read like a confessional:

  • persona_relax_after_turn_8 / 47 sessions / contractions disappear, hedging picks up, exclamation marks leak; immediate fix: restate persona constraints in-context every 6 turns instead of only at system-prompt time.
  • pushback_fold / 23 sessions / agent caves into apologetic register when user challenges; immediate fix: harden system prompt to restate reasoning rather than agree.
  • refusal_corporate / 18 sessions / refusals lose warmth and offer no alternative; immediate fix: add refusal exemplar to few-shot anchors.
  • language_collapse_es / 12 sessions / Spanish replies sit at formal-default; immediate fix: ship per-language calibration set, currently English-only.

Each cluster summary feeds the Platform’s self-improving evaluators, so the same rubric that scored persona adherence gets sharper as drifted conversations are observed and labelled. Engineers cannot promote a failing trace into the dataset on their own; the gold persona-adherent end state needs domain-lead review.

How Future AGI fits persona and style evaluation

Future AGI ships the eval stack as a package. Start with the SDK for code-defined tone and persona rubrics. Graduate to the Platform for self-improving evaluators tuned by domain-lead feedback.

  • ai-evaluation SDK (Apache 2.0). Tone (eval_id=16) for per-turn voice. AnswerRefusal, IsPolite, IsInformalTone, IsHelpful, IsConcise as orthogonal axes. CustomerAgentPromptConformance ships the trajectory-level persona-adherence template; the broader 11-template CustomerAgent family covers conversation quality, clarification seeking, objection handling, termination handling, and human escalation. CustomLLMJudge with grading_criteria and few_shot_examples handles the persona-specific rubric in a few lines. 50+ evaluators total, 20+ heuristic metrics locally.
  • traceAI (Apache 2.0). 50+ AI surfaces across Python, TypeScript, Java, and C#; 14 span kinds with first-class CONVERSATION, AGENT, and LLM; persona rubrics attach via EvalTag with EvalSpanKind.LLM per turn and EvalSpanKind.CONVERSATION per trajectory. PII redaction is built in.
  • simulate-sdk. Persona plus Scenario plus TestRunner drives the adversarial side of the persona test set: pushback ladders, register-flip probes, language switches across OpenAIAgentWrapper, LangChainAgentWrapper, GeminiAgentWrapper, and AnthropicAgentWrapper. ScenarioGenerator auto-produces variants for the persona-pressure family.
  • Future AGI Platform. Self-improving evaluators tuned by domain-lead thumbs feedback. In-product authoring agent writes persona rubrics from natural-language voice descriptions. Classifier-backed style scoring at lower per-eval cost than Galileo Luna-2.
  • Error Feed (inside the eval stack). HDBSCAN clustering plus a Sonnet 4.5 Judge writes the immediate_fix; domain-lead-reviewed promotions feed the dataset and the self-improving evaluators. Linear OAuth wired today; Slack, GitHub, Jira, and PagerDuty on the roadmap.
  • Agent Command Center. OpenAI-compatible AI gateway in a single Go binary (Apache 2.0); 100+ providers; 18+ built-in guardrail scanners with EvalSpanKind.CONVERSATION carrying persona scores back into the same trace tree. SOC 2 Type II, HIPAA, GDPR, and CCPA certified; ISO/IEC 27001 in active audit.

Three honest tradeoffs. Trajectory rubrics cost more per call than turn-level rubrics because the LLM judge reads the whole transcript; the discipline is to gate CI on the curated regression set and sample production trajectories at 5 to 10 percent while letting per-turn tone and deterministic checks run on 100 percent. Voice rubrics are calibration-heavy; inter-rater reliability matters more for style than for correctness, and a rubric without a quarterly human-labelled hold-out drifts out of brand inside a quarter. Self-improving evaluators need a pinned hold-out and quarterly calibration review or the drift moves with the data and no one notices.

Ready to score persona drift on your first trajectories? Wire Tone, CustomerAgentPromptConformance, a CustomLLMJudge for the persona-specific rubric, and the cosine-distance drift slope into a pytest fixture this afternoon against the ai-evaluation SDK, then attach the same rubrics to live conversation root spans via EvalSpanKind.CONVERSATION when production traces start asking questions the CI gate missed.

Frequently asked questions

Why does per-turn tone scoring miss persona drift?
A per-turn rubric scores one reply against the voice description. The cross-turn failure most teams ship hides in the gap between turn 3 and turn 27: the agent stays sharp and warm for the first few turns, then slowly relaxes into the base model's default register. Per-turn scores stay green on every single turn while the conversation as a whole has drifted into generic-helpful-assistant. Averaging the turn scores does not catch this; the mean stays high because every floor passes. The unit that catches persona drift is the trajectory, not the turn. You need a tone rubric per turn AND a persona-stability score that takes the full transcript as input and measures voice variance across the sequence.
How do you design a tone rubric for a persona?
Three parts. A 5-7 sentence voice description with adjectives and contrasts (sharp and warm, not chirpy; direct without being curt; uses contractions; never says 'I'm sorry to hear that.'). Ten to twenty in-brand and out-of-brand example pairs the judge calibrates against. A small set of hard rules the heuristic layer catches before the judge runs (no exclamation marks, no 'I'm just an AI' disclaimers). Future AGI's Tone template (eval_id=16) handles the per-turn pass; CustomLLMJudge wraps the persona-specific rubric with grading_criteria and few_shot_examples. Calibrate against human raters every quarter or the rubric drifts with the data and no one notices.
How do you measure persona drift across a long conversation?
Two complementary signals. First, score the full transcript against a PersonaStability rubric (CustomLLMJudge taking the turn sequence as input, scoring voice variance across the trajectory). Second, embed each turn's response and measure cosine distance from a centroid of in-brand exemplars; rising mean distance across the session is a quantitative drift signal that pairs with the qualitative rubric. CustomerAgentPromptConformance ships as the trajectory-level persona-adherence template in the ai-evaluation SDK. The two together catch drift the per-turn rubric cannot see: the voice that holds on turn 3 and quietly collapses by turn 27.
How do you calibrate a voice rubric against human raters?
Pin a 100-200 example hold-out set. Have two domain reviewers (brand owner plus a senior product writer) score each example 1-5 on the voice rubric independently. Run the LLM judge on the same set. Compute Cohen's kappa between judge and the human mean; target above 0.6 to ship, above 0.75 to trust. Refresh the hold-out quarterly and recompute kappa. If kappa drops below 0.5 the rubric has drifted out of sync with the brand and needs a version bump with updated few-shot examples. Without calibration the judge averages toward generic helpfulness and the team loses trust in the scores within a quarter.
How does production observability catch persona failures the CI gate missed?
Every span carries session.id so per-turn LLM spans roll up into a conversation root. The Tone template attaches per-turn via EvalTag with EvalSpanKind.LLM; the PersonaStability rubric attaches at the trajectory level via EvalSpanKind.CONVERSATION, sampled at 5 to 10 percent of live sessions to keep judge cost bounded. Deterministic checks (banned phrases, exclamation marks, reading level) run on 100 percent of turns. Error Feed clusters failing trajectories with HDBSCAN and a Sonnet 4.5 Judge writes the immediate_fix. The clusters that surface consistently: persona drops the no-emoji rule by turn 8, refusal goes cold and corporate, agent flips to formal register when the user is frustrated.
What's the cheapest way to run persona evaluation at scale?
A three-stage cascade composed inside one Evaluator call. Stage one runs deterministic heuristics (banned phrases, exclamation marks, sentence length, reading level) on every output in microseconds. Stage two runs a small classifier trained on 200-500 in-brand and out-of-brand examples for the items that pass the heuristic gate. Stage three reserves CustomLLMJudge for the 5-10 percent the classifier scored as ambiguous. The persona-stability rubric runs on a 5-10 percent sample of full trajectories rather than every conversation. Cost lands at a small fraction of judge-on-everything while keeping the trajectory drift signal visible.
How does Future AGI fit persona and style evaluation?
Future AGI ships the eval stack as a package. The ai-evaluation SDK (Apache 2.0) gives you Tone (eval_id=16) for per-turn voice, CustomerAgentPromptConformance for trajectory-level persona adherence, and CustomLLMJudge with grading_criteria and few_shot_examples for the persona-specific rubric. traceAI groups spans into sessions and attaches the same rubrics to live conversation root spans via EvalTag with EvalSpanKind.CONVERSATION. Error Feed clusters failing trajectories and writes immediate_fix per cluster. The Platform layers self-improving evaluators tuned by domain-lead feedback at lower per-eval cost than Galileo Luna-2.
Related Articles
View all