How is MAE different from RMSE?

MAE averages absolute errors, while RMSE squares errors before averaging and then takes the square root. RMSE penalizes large misses more strongly; MAE treats each extra unit of error linearly.

How do you measure MAE in FutureAGI?

FutureAGI teams usually implement MAE as a CustomEvaluation over numeric response and expected_response fields. They monitor MAE by dataset slice, model version, prompt version, and trace cohort.

Mean Absolute Error (MAE): FutureAGI Definition

Q: What is mean absolute error?

Mean absolute error is an eval metric that averages the absolute difference between numeric predictions and expected values. It reports error in the original unit, which makes it easy to explain to engineers, product teams, and reviewers.

What Is Mean Absolute Error (MAE)?

Mean absolute error (MAE) is an evaluation metric that averages the absolute difference between a numeric prediction and the expected numeric value. In LLM and agent evaluation, it belongs to numeric-output evals: price estimates, durations, risk scores, extracted quantities, or routed confidence scores. MAE shows up in offline eval pipelines and production trace dashboards when teams need error in the original unit, not squared units. FutureAGI tracks it as a custom scalar beside task evaluators and cohort slices.

Why Mean Absolute Error matters in production LLM and agent systems

MAE catches numeric failures that text-only evals can miss. A benefits assistant may explain a policy correctly but estimate the reimbursement amount 18 dollars too low. A sales agent may choose the right pricing tool but copy the wrong quantity into the quote. A planning agent may predict a delivery window that is consistently off by two days. The response can pass answer relevancy and even sound grounded while the numeric output still breaks the workflow.

Ignoring MAE creates two common failure modes. First, teams ship numeric drift because each miss looks small in isolation, but the average error moves after a model, prompt, retriever, or tool change. Second, teams use pass/fail tolerances too early and lose signal about how wrong the system was. A 1-dollar miss and a 90-dollar miss both fail a strict exact check, but they should trigger different debugging paths.

The pain shows up differently by owner. Developers see flaky extractions and unstable scoring. SREs see more manual overrides, retries, and longer agent runs. Product teams see users correcting numbers in downstream forms. Compliance teams care when the numeric value drives eligibility, pricing, medical triage, or financial advice.

In 2026 multi-step systems, MAE should be traced at the step that produced the number. One bad extraction can feed a calculator, a router, and a final explanation, so a final-answer score alone is too late.

How FutureAGI handles mean absolute error

FutureAGI’s approach is to treat MAE as a custom numeric eval, not as a generic quality score. The FutureAGI evaluator inventory does not expose a dedicated MeanAbsoluteError class, so engineers usually implement MAE with CustomEvaluation and store row-level abs_error plus aggregate mae in the dataset evaluation result. That keeps the metric explicit and prevents it from being confused with semantic quality, groundedness, or task completion.

A practical workflow starts with a numeric-output dataset: user request, model output, expected value, unit, model version, prompt version, and route. The engineer parses the numeric response, computes abs(response_value - expected_value), then records both the row error and cohort-level MAE through the eval pipeline. If the run came from a LangChain agent instrumented through the langchain traceAI integration, the same trace can carry llm.token_count.prompt, model name, tool span, and agent.trajectory.step, which makes it clear whether the bad number came from retrieval, extraction, calculation, or final formatting.

Unlike scikit-learn’s mean_absolute_error, which is usually called as an offline aggregate, FutureAGI workflows keep MAE attached to prompts, datasets, traces, and release gates. A team might set mae <= 2.00 USD for invoice extraction, then block a release when the global MAE is acceptable but the “international tax” slice jumps from 1.20 to 7.80. The next action is not a broad rollback by default; it is a targeted regression eval on that slice, followed by prompt repair, tool-schema cleanup, or human annotation.

How to measure or detect mean absolute error

Measure MAE only after normalizing units, currencies, timestamps, and numeric formats. Then keep both row-level and aggregate signals:

Row absolute error — abs(response_value - expected_value); inspect this before averaging so outlier examples stay visible.
MAE-by-cohort dashboard — segment by model, prompt version, tool route, language, unit, tenant, and dataset slice.
CustomEvaluation — FutureAGI evaluator surface for recording the custom scalar and row metadata in an eval run.
Trace signals — compare MAE with llm.token_count.prompt, p99 latency, retry rate, and tool-error rate when numeric drift appears.
User-feedback proxy — corrected-number rate, manual override rate, refund adjustment rate, or escalation rate for rows without labels.

Minimal Python:

from fi.evals import CustomEvaluation

def mae(rows):
    errors = [abs(row.predicted - row.expected) for row in rows]
    return sum(errors) / len(errors)

score = mae(eval_rows)
print(score)

Pair MAE with metric thresholds that match the business unit. A two-cent miss may be noise in invoice estimation and unacceptable in tax calculation.

Common mistakes

Engineers usually misuse MAE by losing the unit, the slice, or the failure magnitude:

Averaging mixed units. Minutes, dollars, percentages, and risk scores need separate MAE series or normalized targets.
Dropping row-level errors. The aggregate hides whether one extreme miss or many small misses caused the regression.
Using MAE for categorical outputs. Labels, enum values, and tool names need exact match, confusion matrix, precision, or recall.
Ignoring signed error. MAE hides bias direction; also track mean error when overestimation and underestimation have different costs.
Comparing after prompt-format changes. A new extraction instruction can change parsing behavior even when the underlying model is unchanged.