How is a monitor threshold different from a metric threshold?

Metric threshold is the more general term — any cutoff on a metric. Monitor threshold is the same idea wired into an observability tool's alerting layer, with a notification channel, owner, and runbook attached.

How do you set a good monitor threshold?

Anchor on production baseline (p99 of last 30 days), pick a deviation that would matter, and tune for a low-but-non-zero false-positive rate. FutureAGI lets you set thresholds per evaluator and per OTel attribute.

What Is a Monitor Threshold? Definition & FutureAGI Guide (2026)

Q: What is a monitor threshold?

A monitor threshold is the value at which an observed metric — latency, eval-fail-rate, drift, cost — triggers an alert, page, or automated rollback action.

What Is a Monitor Threshold?

A monitor threshold is the configured value an observed metric must cross to trigger an alert, page, or automated action. In an LLM or agent observability stack it gates signals like p99 latency, eval-fail-rate, drift score, cost-per-trace, or refusal rate. The threshold has three parts: the metric, the comparator (above, below, delta), and the action (notify, page, rollback). In a FutureAGI workflow you configure thresholds per evaluator (Groundedness < 0.7) and per OTel attribute (llm.token_count.completion > 4000), with each one wired to a notification channel and a runbook owner.

Why It Matters in Production LLM and Agent Systems

A dashboard without thresholds is a museum exhibit — pretty, but no one is going to act on it at 2am. The whole point of monitoring is to convert observability data into a stream of signals that a human or an automated pipeline can act on. Thresholds are the gate. Set them too tight and the team learns to ignore the pager. Set them too loose and a regression sits in production for a week before anyone notices users churning over it.

The pain is concrete. An ML engineer ships a prompt change at 6pm. The eval-fail-rate climbs from 2% to 8% over the next two hours, but no one looks at the dashboard until the next morning — by then 40K bad responses are out. A SRE chasing a latency spike finds the p99 alarm was set at 5s and the actual SLO is 1.5s — the alert never fired. A safety lead is asked why a jailbreak attempt was not caught and discovers the PromptInjection evaluator was firing scores >0.9, but the threshold was set at 0.99.

In 2026-era stacks the threshold count explodes. Each agent step, each tool, each retrieval source, each model variant, each user cohort can have its own threshold. The discipline is no longer “set a few good alerts” — it is “manage a threshold catalog as an artifact”. FutureAGI’s monitoring layer treats every threshold as a versioned config tied to an OTel attribute or evaluator class, so it can be reviewed, tuned, and retired like any other code.

How FutureAGI Handles Monitor Thresholds

FutureAGI’s approach to thresholds wires them directly into the surfaces that emit the metric. Evaluator thresholds: when you attach an evaluator like Groundedness to a production trace stream, you also attach a threshold (< 0.7 → warning, < 0.5 → page). The evaluator score is written back as a span_event, and the monitoring layer compares it on every trace. Attribute thresholds: any OTel attribute can carry a threshold — llm.token_count.completion > 4000 flags overspend, agent.trajectory.step > 15 flags potential infinite loops, tts_first_token_latency > 800ms flags voice-agent regressions. Cohort-aware thresholds: thresholds can be sliced by user.cohort, model.name, or route so a threshold tightens for a critical route and loosens for an experimental one. Notification fan-out: each threshold maps to a notification channel and a runbook URL — the alert lands with context, not as a dashboard mystery.

Concretely: a RAG team running on traceAI-langchain attaches Groundedness and ContextRelevance evaluators to 5% of production traces. They configure a Groundedness < 0.65 threshold to warn and < 0.5 to page. When a vector-store reindex breaks chunking on a Saturday morning, the page fires within 20 minutes, the on-call engineer opens the trace view, and the broken cohort is visible — the threshold did the job that a dashboard alone cannot.

How to Measure or Detect It

Thresholds turn metrics into signals; pick the ones that match your reliability budget:

Per-evaluator threshold (config): a numeric cutoff on the score returned by an fi.evals evaluator.
Per-attribute threshold (config): a numeric cutoff on an OTel attribute like llm.token_count.completion or latency.ms.
Threshold-fire-rate (dashboard): the count of threshold breaches per hour by metric — rising rate = noisy threshold or actual regression.
False-positive ratio (dashboard): the fraction of fires that turned out to be benign — used to retune thresholds.
Mean-time-to-acknowledge (operational): how fast on-call responds to a fired threshold; tracks alert quality and runbook clarity.

Minimal Python:

from fi.evals import Groundedness

g = Groundedness(threshold=0.7)  # < 0.7 → marked failed
result = g.evaluate(input=q, output=ans, context=ctx)
print(result.score, result.passed)

Common Mistakes

Setting thresholds on raw mean. A mean masks tail failures; set thresholds on p95/p99 or on per-cohort scores so you catch regressions in slices that matter.
Forgetting to retune after model swaps. A threshold calibrated on gpt-4o may not hold on claude-sonnet; rebaseline after every model change.
One threshold per metric, no severities. A single line creates pager fatigue; split into warn / page / rollback so the response matches the severity.
No runbook attached to the threshold. An alert without a runbook is a guessing game at 3am — link the threshold to the steps the on-call should take.
Setting thresholds without a baseline. “Latency > 1s” only means anything if you know the p99 baseline; always anchor on observed distribution.