Guides

Evaluating Agentic Workflows Orchestrated With Temporal

Temporal turns agent workflows into replayable state machines. The eval that matches: per-activity correctness, workflow outcome, retry budget enforcement, signal-handler correctness. Replay is the superpower.

·
Updated
·
12 min read
agent-evaluation temporal durable-workflows ai-gateway llm-observability 2026
Editorial cover image for Evaluating Agentic Workflows Orchestrated With Temporal
Table of Contents

A procurement agent on Temporal sleeps four days waiting for a vendor reply. The reply lands at 2 a.m. Sunday as a signal, the workflow resumes, the agent compares quotes, places the order. On the dashboard the run looks clean: TaskCompletion 0.93, cost $0.41, no errors. A week later, finance flags it. The vendor list excluded one supplier the user named in the original prompt. The signal handler dropped the exclusion when it merged the late reply into state. Nothing in the per-turn rubric moved. Everything broke in the seam between the original prompt, the four-day sleep, and the signal in the middle.

Temporal turns agent workflows into deterministic, replayable state machines. Workflow code is deterministic by contract, non-determinism lives inside activities, every run reconstructs from history. That flips eval from a flaky black-box exercise into a reproducible one — and changes what you need to score. The eval that matches Temporal is four layers: per-activity correctness, workflow-level outcome, retry-budget enforcement, and signal-handler correctness. Replay is the superpower. A regression that surfaced last Tuesday replays identically in CI today, and bisects to the activity that moved.

This post is the working pattern: what to instrument, what to score at each layer, where the eval lives, and how to feed failures back through Error Feed into the activity prompts that produced them.

TL;DR: the Temporal-aware rubric set

LayerWhat it scoresFAGI surface
Per-activity correctnessTemplate per activity kind, attached to the activity spanEvalTag + EvalSpanKind
Workflow outcomeOne TaskCompletion at the workflow’s terminal signalai-evaluation SDK
Retry-budget correctnessRetryBudgetCorrectness CustomLLMJudge over the retry sequenceai-evaluation SDK
Signal-handler correctnessSignalHandlerCorrectness CustomLLMJudge per signalai-evaluation SDK
Replay equivalenceTwo-replay Groundedness delta on the same inputtraceAI + ai-evaluation
Cost ceilingAggregate x-prism-cost across the span treeAgent Command Center
Closed loopCluster failures, write immediate_fix, retrain evaluatorError Feed

Why Temporal changes the eval game

A normal async agent runs in one process for a few seconds. Latency is bounded, state fits in memory, the eval problem is “was the response good.” A Temporal-orchestrated agent looks nothing like that. Four properties change the eval surface:

Lifetime is unbounded. A workflow can sleep for days waiting on a signal, then resume. The eval has to survive worker restarts, deploys, and version changes between start and end. There is no continuous process to hold eval state in memory.

The execution tree has structure. A parent workflow spawns children; each has its own activities; siblings run in parallel and fail independently. Scoring has to attribute correctness at the activity, child, and parent level — not as one score on the final response.

Retries are first-class. An activity that succeeded did not necessarily succeed on the first try. An agent that retries the same tool call five times before succeeding is technically correct but operationally broken. The eval has to score the retry decision itself.

Replay is deterministic. This is the property that flips the eval problem in your favour. Workflow code replays identically from the history; LLM calls live inside activities, Temporal records their results, so on replay the activity returns the same output. The same regression that surfaced in production is reproducible in CI as long as the workflow history is preserved. A flaky agent rubric becomes a bisectable one.

The catch: replay determinism only holds when non-determinism stays inside activities. An LLM call made directly in workflow code replays non-deterministically, the history diverges, and the workflow corrupts. We score this drift directly in Layer 5 below.

If you are new to the broader category, the LLM agent architectures piece covers the vocabulary, and the observability vs evaluation vs benchmarking post draws the lines this article assumes.

Map the workflow to spans before you score it

A retry rubric needs a retry span. A signal-handler rubric needs a signal-handler span. Before any template runs, the workflow has to be visible as a trace.

Two pieces wire it. temporalio.contrib.opentelemetry.TracingInterceptor goes on the client and the worker so workflow and activity spans emit through OpenTelemetry. fi_instrumentation.register configures the FAGI tracer provider and installs the per-framework LLM instrumentor for whichever SDK each activity uses.

from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_openai import OpenAIInstrumentor
from temporalio.client import Client
from temporalio.contrib.opentelemetry import TracingInterceptor
from temporalio.worker import Worker

tracer_provider = register(
    project_type=ProjectType.OBSERVE,
    project_name="procurement-agent-prod",
)
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)

client = await Client.connect(
    "temporal.internal:7233",
    interceptors=[TracingInterceptor()],
)
worker = Worker(client, task_queue="agent", workflows=[...], activities=[...])

Inside the workflow, set session.id to the Temporal workflow_id so the multi-day run threads under one session attribute. Each activity opens its own span. The fi.span.kind attribute marks role: CHAIN for workflow orchestration, AGENT for an LLM-driven loop, TOOL for an external API call, RETRIEVER for retrieval. Child workflows nest naturally because parent context carries through the interceptor. The whole tree reconstructs in the FAGI UI as one connected graph, even though execution spanned dozens of worker processes.

LLM calls belong in activities, never in workflow code. A workflow that wraps an LLM call directly will replay non-deterministically and break Temporal’s guarantees — the most common rookie defect. We re-score it in Layer 5 because the eval surface is the cleanest place to catch it.

Layer 1: per-activity correctness

One template per activity kind. Pick the two or three that match what the activity does, score those, attach the result.

  • TOOL activities (external API or function call): LLMFunctionCalling on call shape and arguments.
  • RETRIEVER activities (vector, BM25, hybrid): Groundedness (47) and ContextAdherence (5) on retrieved-context-to-output.
  • AGENT activities (multi-step LLM reasoning in one activity): Completeness (10) plus TaskCompletion scoped to the sub-goal.
  • Refusal-eligible activities: AnswerRefusal (88) so over-cautious refusal scores against the agent.

Tag every score with EvalSpanKind so queries stay cheap. “Show me all TOOL activities in the last 24h with LLMFunctionCalling below 0.7” is one filter, not a join across three tables. The agent evaluation frameworks 2026 post covers the broader template set.

Layer 2: workflow-level outcome

One eval per workflow, run at completion. The question is whether the workflow achieved the user-visible goal: sourced the part, closed the ticket, returned the brief. TaskCompletion (99) handles this. Run it as a follow-up worker subscribed to the workflow’s terminal signal so the eval never blocks the workflow itself.

from fi.evals import Evaluator
from fi.evals.templates import TaskCompletion
from fi.testcases import TestCase

evaluator = Evaluator()  # FI_API_KEY / FI_SECRET_KEY from env
result = evaluator.evaluate(
    eval_templates=[TaskCompletion()],
    inputs=[TestCase(input=workflow_input, output=workflow_output, context=workflow_summary)],
)

The score attaches to the parent workflow span via EvalTag. The pairing rule: passing TaskCompletion with failing Layer-1 scores says the agent covered for an upstream bug; the next harder query will fail. Failing TaskCompletion with passing Layer-1 scores says every step was clean but the orchestration assembled them wrong. Both layers run because either one alone hides the other’s failure mode.

Layer 3: retry-budget correctness

This is the eval most teams skip and most teams need. Two activities with the same final output are not the same activity if one succeeded on attempt 1 and the other consumed four of five retries to get to attempt 5. The cost and reliability graphs reflect it. The standard rubric does not. A RetryBudgetCorrectness CustomLLMJudge:

from fi.evals.judge import CustomLLMJudge

retry_budget = CustomLLMJudge(
    name="RetryBudgetCorrectness",
    rubric=(
        "Given the activity input, the sequence of failure reasons, the retry "
        "policy, and the eventual outcome, score whether the retry consumption "
        "was warranted. "
        "1.0 = retries warranted, resolved a transient failure on the expected "
        "attempt. "
        "0.5 = succeeded but consumed more budget than the failure type "
        "warranted (wasted compute on a permanent failure, or thrashed against "
        "a downstream rate limit). "
        "0.0 = gave up on a recoverable failure, or looped the full budget on "
        "a permanent one."
    ),
    input_mapping={
        "activity_input": "activity.input",
        "failure_sequence": "activity.attempt_failure_reasons",
        "retry_policy": "activity.retry_policy",
        "final_outcome": "activity.final_output",
    },
)

Score every activity that retried at least once. Aggregate at the workflow level: three activities each retrying four times scores worse than everything succeeding first try, even when the final output is identical.

Activities consistently consuming four-plus of five retries are the signal that the prompt, tool spec, or downstream API needs work — not the retry policy. Workflow-level retry-budget aggregates moving sideways over a week are the earliest indicator of model-checkpoint drift. A refreshed gpt-4o trained to be more cautious will retry more on the same tool spec; the budget rubric catches it before the cost graph does.

Layer 4: signal-handler correctness

Signal handlers are the most under-evaluated surface in Temporal agents. They fire asynchronously, sometimes hours or days after the original input, and produce no obvious response to score against — the signal just updates state. The standard rubric has nothing to point at.

The fix: open a SIGNAL_HANDLER span on signal arrival, capture name, payload, and a snapshot of state before and after the handler runs. A SignalHandlerCorrectness CustomLLMJudge then scores three things on the next workflow turn:

  • Interpretation. Did the handler parse the signal payload correctly?
  • Invariant preservation. Did it preserve constraints from prior state — exclusion lists, budget caps, user-named entities?
  • Branch correctness. Did the agent continue on the branch the signal implies? “Exclude vendor X” should suppress vendor X for the rest of the workflow, not just the next decision.

This is the rubric that catches the procurement bug at the top of this post. The handler updated state correctly; the agent ignored it on the next decision. Invariant preservation passes, branch correctness fails. Score every signal received during the run, attach the verdict to the SIGNAL_HANDLER span, aggregate at workflow completion. The evaluating LLM agent handoffs post covers the related handoff-rubric pattern across frameworks.

Layer 5: replay equivalence

Drift in a Temporal workflow has a specific shape: output A on the first run, output B on replay or version-bumped re-execution. The structural guard is to keep non-determinism inside activities. The eval-side guard is a replay-equivalence rubric.

Trigger two replays of the same input on the registered worker. Diff the resulting span trees. Run Groundedness and Completeness over both terminal outputs against a shared reference. If the outputs disagree beyond a small threshold, flag the workflow drift-prone and surface the diverging activity — usually one that leaked an LLM call into workflow code, or one that calls workflow.uuid4() without seeding. Replay equivalence is the rubric that turns Temporal’s biggest reliability feature into the eval’s biggest debugging feature. The agent reliability metrics post covers the broader surface.

Run evaluation asynchronously, not in the workflow path

Calling Evaluator().evaluate(...) directly from inside an activity works in a demo and falls over in production: it adds eval latency to every run, ties workflow durability to eval-stack uptime, and inflates the workflow history with payloads that do not belong there.

The cleaner pattern: emit a Temporal signal at activity completion with the inputs an eval needs, drain the queue from a separate worker pool.

@activity.defn
async def summarize_findings(context: list[str]) -> str:
    output = await llm.summarize(context)
    await workflow_handle.signal(
        "eval_request",
        {
            "activity": "summarize_findings",
            "input": context, "output": output,
            "templates": ["Completeness", "Groundedness"],
        },
    )
    return output

Scores attach back to the original activity span through EvalTag, so the trace tree carries both execution and verdict even though they were produced minutes or hours apart in different processes. The four distributed runners in the ai-evaluation SDK (Celery, Ray, Temporal, Kubernetes) include Temporal explicitly, so the eval suite itself runs as a durable workflow alongside the production workload. One operational surface, two roles.

Cost ceiling and promotion gating

Every LLM call inside an activity routes through Agent Command Center. The gateway response carries x-prism-cost, x-prism-latency-ms, x-prism-model-used, x-prism-fallback-used, x-prism-routing-strategy. Attach them as activity-span attributes; sum x-prism-cost across the tree at workflow completion. Five-level budgets (org, project, workflow type, user, request) stop a runaway workflow before a retry loop burns a month’s budget on one bad input.

For new prompt or model selections, the gateway’s shadow and mirror routes are the promotion gate. Shadow runs the new version against real inputs without surfacing the output; mirror runs both and serves the old. Both feed the same eval batch. The new version promotes when its Layer-1 through Layer-4 scores meet or exceed the baseline on a dataset frozen weekly from production traces. The LLM evaluation playbook covers dataset cadence and threshold logic.

Closed loop: Error Feed clusters by failure layer

Failed workflows flow into Error Feed. HDBSCAN soft-clustering over ClickHouse-stored span embeddings groups failures into named clusters. For Temporal agents the clusters are layer-shaped, because that is the level the rubrics scored at:

  • Per-activity. “Vendor-lookup TOOL activity returns ambiguous match, agent picks the first hit instead of asking for disambiguation.”
  • Retry-budget. “Quote-request activity consumes four of five retries before timing out; budget should fail fast.”
  • Signal-handler. “User clarification signal updates exclusion list but next compare_quotes turn ignores it.”
  • Replay-equivalence. “Workflow using uuid4() inside workflow body produces different vendor ordering on replay.”

A Claude Sonnet 4.5 JudgeAgent runs a 30-turn investigation per cluster across eight span-tools (Haiku Chauffeur for spans over 3000 characters; prompt-cache hit near 90 percent). It writes a 5-category 30-subtype taxonomy entry, a 4-D trace score (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution; 1-to-5), and an immediate_fix naming the activity prompt edit, retry-policy adjustment, or signal-handler tighten that ships today.

Honest scope. The trace-stream-to-agent-opt connector is roadmap — it would auto-promote a failing replay trace into the agent-opt dataset and re-run BayesianSearchOptimizer on the offending prompt. Eval-driven prompt optimization on individual activities ships today through the six optimizers in agent-opt (RandomSearchOptimizer, BayesianSearchOptimizer, MetaPromptOptimizer, ProTeGi, GEPAOptimizer, PromptWizardOptimizer) plus EarlyStoppingConfig, if the dataset is assembled from Error Feed clusters by hand. Linear is the only ticket integration today; Slack, GitHub, Jira, PagerDuty are on the roadmap.

How Future AGI ships this

Four surfaces, one loop, no separate products to glue.

  • ai-evaluation SDK (Apache 2.0). Evaluator, 60-plus EvalTemplate classes (TaskCompletion, LLMFunctionCalling, AnswerRefusal, Groundedness, ContextAdherence, Completeness). CustomLLMJudge carries RetryBudgetCorrectness, SignalHandlerCorrectness, and any activity-specific rubric. Four distributed runners (Celery, Ray, Temporal, Kubernetes) so the eval suite itself runs as a durable workflow alongside production.
  • traceAI (Apache 2.0). register(...) configures the tracer provider; per-framework instrumentors (OpenAIInstrumentor, AnthropicInstrumentor, LangChainInstrumentor, plus 50-plus more) hook the LLM SDK each activity uses. Combined with TracingInterceptor on the client and worker, the workflow + activity + LLM-call tree reconstructs natively. EvalTag attaches rubrics server-side at zero inline latency.
  • Future AGI Platform. Self-improving evaluators tuned by thumbs-up/down feedback from production traces, in-product custom-rubric authoring, classifier-backed scoring at lower per-eval cost than Galileo Luna-2. Error Feed sits inside with HDBSCAN clustering, the Sonnet 4.5 JudgeAgent, the 5-category 30-subtype taxonomy, and the immediate_fix artifact.
  • Agent Command Center. Go binary self-hosts in your VPC for the LLM calls underneath every activity. 100-plus providers, 18-plus guardrail scanners, exact and semantic caching, five-level budgets, shadow and mirror routes for promotion gating. SOC 2 Type II, HIPAA, GDPR, CCPA certified; ISO/IEC 27001 in active audit.
  • agent-opt (Apache 2.0). Six optimizers consuming the layer rubrics as the objective. Each activity prompt is a separate study target, with shared EarlyStoppingConfig so a winning tweak on compare_quotes is not masked by a flat identify_part.

Honest tradeoff: if the workload is one activity and a tool registry, a lighter tracer plus a hand-rolled TaskCompletion rubric is enough. The four-layer stack earns its weight when the workflow is real — child workflows, multi-day signals, retry policies that matter, production traffic.

What to do this week

One workflow, end to end. Five steps.

  1. Instrument one worker. TracingInterceptor() on the client; register(...) plus the per-framework LLM instrumentor on the worker. Set session.id to workflow_id. Confirm the tree reconstructs in the FAGI UI.
  2. Score the workflow outcome. One TaskCompletion at the terminal signal, run by an async eval worker subscribed to a signal queue.
  3. Add one per-activity template per activity kind. Two or three rubrics matched to the activity kind is enough to start.
  4. Add RetryBudgetCorrectness and SignalHandlerCorrectness. The highest-impact rubrics most teams aren’t running. The first surfaces silent cost burn; the second catches the procurement-bug pattern at the top of this post.
  5. Turn on Error Feed. Read the immediate_fix text for the top three clusters. That is your prompt-optimization backlog for the week.

Temporal gives you replay. The four-layer rubric set turns replay into a debugging tool, one activity at a time.

Frequently asked questions

Why does Temporal change the agent eval game?
Temporal turns an agent workflow into a deterministic, replayable state machine. Workflow code is deterministic by contract, non-determinism lives inside activities, and every run is reconstructible from the workflow history. That property flips eval from a flaky black-box exercise into a reproducible one. The same input that triggered a bad consensus last Tuesday replays identically in CI today, the same activity that retried five times before succeeding replays with the same retry trace, and a regression on a rubric can be bisected to the activity that moved. The eval surface that matches Temporal is four-layered: per-activity correctness, workflow-level outcome, retry-budget enforcement, and signal-handler correctness. Anything less leaves the most expensive failures unscored.
What is the four-layer rubric set for a Temporal workflow?
First, per-activity correctness scored with the template that matches the activity kind: LLMFunctionCalling for TOOL activities, Groundedness and ContextAdherence for RETRIEVER activities, Completeness for AGENT activities, AnswerRefusal where refusal is a valid action. Second, workflow-level outcome scored with TaskCompletion at the workflow's terminal signal. Third, retry-budget correctness scored with a CustomLLMJudge that reads the retry sequence and decides whether each retry was warranted. Fourth, signal-handler correctness scored when an external signal arrives mid-workflow: did the handler interpret the signal correctly, did it preserve the prior state, did the agent continue on the right branch. The four layers run as independent rubrics so a regression points at the specific layer that moved.
How do you instrument a Temporal worker for traceAI?
Two pieces. The temporalio.contrib.opentelemetry TracingInterceptor goes on the client and the worker so workflow and activity spans emit through OpenTelemetry. The fi_instrumentation register call configures the FAGI tracer provider and installs the per-framework instrumentor for whichever LLM SDK each activity uses (OpenAIInstrumentor, AnthropicInstrumentor, LangChainInstrumentor, etc.). After both are wired, the workflow span becomes the root, every activity opens its own child span, the LLM calls inside activities attach as grandchildren, and session.id is set to workflow_id so a multi-day run reconstructs in one tree. Replay produces the same span tree.
How do you score retry-budget correctness?
Define a RetryBudgetCorrectness rubric as a CustomLLMJudge. The rubric reads the activity input, the sequence of failure reasons, the retry policy, and the eventual outcome. It scores 1.0 when retries were warranted and resolved a transient failure on the expected attempt, 0.5 when the activity eventually succeeded but consumed more of the retry budget than the failure type warranted (wasted compute on a permanent failure that should have surfaced earlier), and 0.0 when the agent gave up on a recoverable failure or looped through the full budget on a permanent one. Score every activity that retried at least once. An activity that consistently consumes four or more of its five retries is the signal that the underlying prompt, tool spec, or downstream API needs work, not the retry policy.
How do you evaluate signal handlers specifically?
Signal handlers are the most under-evaluated surface in Temporal agents because they fire asynchronously and produce no obvious response to score. The pattern: when a workflow.signal arrives, open a SIGNAL_HANDLER span with the signal name, payload, and current workflow state snapshot. The handler updates state. A SignalHandlerCorrectness rubric (CustomLLMJudge) then scores three things on the next workflow turn: did the handler interpret the signal payload correctly, did it preserve invariants from the prior state, and did the agent continue on the branch the signal implies. A user clarification signal that says exclude vendor X should suppress vendor X for the rest of the workflow, not just the next turn. The rubric catches handlers that update state but the agent ignores it on the next decision.
Should evaluation block the workflow or run asynchronously?
Asynchronously, almost always. Blocking a workflow on an LLM-as-judge call adds latency and cost to every run and ties workflow durability to eval-stack durability. The cleaner pattern is to emit the eval inputs as a Temporal signal at activity completion and have a separate eval worker pool consume the queue. Scores attach back to the original activity span through EvalTag so the trace tree carries both execution and verdict. Gating decisions (promoting a new prompt version, allowing a high-cost activity to run) stay synchronous on a shadow path before promotion. Production runs stay asynchronous. The four distributed runners in the ai-evaluation SDK include Temporal explicitly, so the eval suite itself can run as a durable workflow.
What does Future AGI ship for Temporal eval today versus what is roadmap?
Shipping today: the temporalio.contrib.opentelemetry TracingInterceptor wired through traceAI's tracer provider so workflow, activity, and signal spans flow into FAGI; the ai-evaluation SDK with TaskCompletion, LLMFunctionCalling, Groundedness, ContextAdherence, Completeness, AnswerRefusal, plus CustomLLMJudge for retry-budget and signal-handler rubrics; four distributed runners (Celery, Ray, Temporal, Kubernetes) so the eval suite can itself run on Temporal; Agent Command Center routing every LLM call inside an activity with five-level budgets and x-prism-cost headers; Error Feed inside the eval stack with HDBSCAN clustering and Sonnet 4.5 Judge writing immediate_fix per cluster; Linear OAuth as the ticket sink. Roadmap: the trace-stream-to-agent-opt connector that would auto-promote failing replay traces into the prompt-optimization dataset; Slack, GitHub, Jira, and PagerDuty integrations for Error Feed.
Related Articles
View all
The LLM Eval Vendor Buyer's Guide for 2026
Guides

Heads-of-engineering buyer's guide for LLM eval vendors in 2026. Ten buying criteria, eight vendor categories scored honestly, a five-question rubric, and a procurement workflow.

NVJK Kartik
NVJK Kartik ·
16 min