How is MAPE different from mean absolute error?

Mean absolute error reports error in the original unit, such as dollars or minutes. MAPE reports error as a percentage, which is easier to compare across cohorts but more sensitive to small denominators.

How do you measure MAPE?

Compute absolute percentage error per example, average it over a dataset, and log it as a custom evaluation in FutureAGI. Use NumericSimilarity or GroundTruthMatch nearby when you also need built-in numeric closeness or exact target checks.

Mean Absolute Percentage Error: Definition & FutureAGI Guide (2026)

Q: What is mean absolute percentage error?

Mean absolute percentage error, or MAPE, is a numeric model-evaluation metric that reports average absolute prediction error as a percentage of actual values. It helps compare forecast error across different scales, but it is unstable when actual values are zero or very small.

What Is Mean Absolute Percentage Error?

Mean absolute percentage error (MAPE) is a model-evaluation metric that measures average absolute prediction error as a percentage of the actual value. It is used for numeric model outputs, forecasts, estimates, and agent-generated quantities in an eval pipeline or production trace. FutureAGI teams treat MAPE as a scale-aware reliability signal, not a universal accuracy score, because it becomes undefined at zero actual values and can overstate errors for low-volume rows.

Why it matters in production LLM/agent systems

MAPE fails loudly when numeric outputs drive business action. A support-planning agent that predicts ticket backlog, a finance copilot that estimates cloud spend, or a logistics assistant that forecasts delivery time can look accurate in prose while producing numbers that miss by 35 percent. If that miss is averaged away or reported only in raw units, product teams may ship a model that works for large accounts and fails for smaller cohorts.

The pain usually lands on three groups. Developers see brittle evals that pass on qualitative correctness but fail once a customer checks the number. SRE teams see escalations, retries, and manual overrides clustered around low-denominator cases. Product and finance teams see trust erosion because a percentage error is easier for users to notice than a phrasing issue.

Common symptoms include rising absolute-percentage-error p95, high eval-fail-rate-by-cohort for small accounts, thumbs-down comments that mention “wrong estimate,” and traces where tool outputs are correct but the final answer rounds, converts units, or copies the wrong value. In multi-step agent systems, MAPE is especially useful because error can enter through retrieval, tool selection, unit conversion, summarization, or final formatting. Unlike mean absolute error, MAPE shows whether a five-dollar miss matters more for a ten-dollar forecast than for a ten-thousand-dollar forecast.

How FutureAGI handles Mean Absolute Percentage Error

There is no dedicated MAPE anchor or named MAPE evaluator in the FutureAGI inventory, so the clean pattern is to attach it as a custom numeric metric to a dataset evaluation. For example, a SaaS finance agent predicts next-month invoice totals from CRM context, usage events, and a pricing tool. The dataset stores actual_value, predicted_value, account_tier, and trace_id. The evaluation computes abs(actual_value - predicted_value) / abs(actual_value) * 100 per row, then aggregates median, p90, and p95 MAPE by cohort.

FutureAGI’s approach is to treat MAPE as a business-scale metric beside built-in numeric and trace signals. CustomEvaluation is the right conceptual fit for the exact formula. NumericSimilarity is useful as the nearest built-in numeric-output check when percentage error is not the contract. GroundTruthMatch can catch exact target failures for deterministic outputs, while traceAI-langchain traces show where the number entered the chain through spans such as agent.trajectory.step and token fields such as llm.token_count.prompt.

The engineer’s next action depends on the failure shape. If p95 MAPE exceeds 40 percent for small accounts but median MAPE stays under 8 percent, the fix is not a model swap. It is a denominator guard, a cohort-specific threshold, or a fallback to the deterministic pricing tool. Unlike Ragas faithfulness, which checks whether text is supported by retrieved context, MAPE asks whether a numeric answer is close enough relative to the true value.

How to measure or detect it

Use MAPE only when actual values are positive and the business cares about percentage error. For zero or near-zero actuals, switch to mean absolute error, symmetric MAPE, weighted absolute percentage error, or a custom denominator floor.

Formula: mean(abs(actual - predicted) / abs(actual)) * 100 across examples with valid actual values.
Dataset signal: report median, p90, and p95 MAPE by cohort, not only the global average.
Trace signal: attach trace_id, agent.trajectory.step, and the tool result that produced each numeric value.
Dashboard signal: alert on MAPE regression by release, prompt version, account tier, or tool route.
Feedback proxy: track thumbs-down rate and escalation rate for responses containing numeric forecasts.

from fi.evals import NumericSimilarity

def mape(actual, predicted):
    return abs(actual - predicted) / max(abs(actual), 1e-9) * 100

score = mape(actual=1250.0, predicted=1188.0)
nearest_builtin = NumericSimilarity

The snippet keeps the MAPE formula explicit and imports NumericSimilarity only as the nearby FutureAGI evaluator class for numeric closeness checks.

Common mistakes

Averaging MAPE across rows with zero actual values. The metric is undefined there, so the report hides a data-quality problem.
Comparing models by global MAPE only. A good average can mask high p95 error for low-volume or regulated cohorts.
Treating MAPE as better than MAE in every case. Percentage error can overweight small denominators and distort model selection.
Using MAPE for signed quantities where negative actual values are meaningful. The percentage interpretation becomes hard to defend.
Letting an agent round intermediate values before evaluation. Compute MAPE from raw numeric outputs, then evaluate display formatting separately.