How is F1 score different from accuracy?

Accuracy measures the fraction of all predictions that are correct, which is misleading on imbalanced datasets — a 99% accurate model on a 1% positive class might just be predicting the negative class. F1 ignores true negatives and focuses on the positive class, so it is robust to class imbalance.

How do you measure F1 score?

Compute precision and recall from the confusion matrix, then F1 = 2 · (precision · recall) / (precision + recall). FutureAGI evaluators like Equals, GroundTruthMatch, and FuzzyMatch produce per-row labels you can aggregate into precision, recall, and F1 across cohorts.

What Is the F1 Score? Definition and FutureAGI Guide (2026)

Q: What is the F1 score?

F1 score is the harmonic mean of precision and recall — F1 = 2 · (precision · recall) / (precision + recall) — returning a 0–1 classifier metric that penalises imbalanced trade-offs and is the default summary metric when you care about both false positives and false negatives.

What Is the F1 Score?

F1 score is a foundational classification metric defined as the harmonic mean of precision and recall: F1 = 2 · (precision · recall) / (precision + recall). It returns a single 0–1 value that penalises imbalanced trade-offs — a model with 0.95 precision and 0.20 recall scores about 0.33, not the 0.575 that a simple average would suggest. F1 is the right summary metric whenever you care about both false positives and false negatives, including span-level extraction, retrieval relevance, intent classification, and LLM safety classifiers in production.

Why It Matters in Production LLM and Agent Systems

Accuracy lies on imbalanced data. A safety classifier with 99.2% accuracy on a stream where 0.5% of inputs are toxic might be detecting nothing — predicting “safe” for everything gets 99.5% accuracy. F1 is the metric that exposes that. It collapses to zero when either precision or recall collapses, so you cannot hide a broken model behind a tilted class distribution.

The pain shows up across LLM use cases that involve a positive class. A PII-detection classifier that flags 100% of inputs has perfect recall and terrible precision — F1 will be near zero. A retrieval reranker that returns one extremely relevant chunk out of ten possible has perfect precision and 10% recall — F1 will be 0.18. A function-name extractor that misses half of the calls in a transcript looks fine on accuracy because most of the transcript is non-call text — F1 anchored to the call class shows the real picture.

In 2026 LLM stacks, F1 is the right default for guardrail evaluation, intent routing, span-level NER on agent transcripts, and any classifier whose negative class dwarfs the positive. Multi-class settings extend it to micro-F1 (aggregate over all instances) and macro-F1 (average per class) — the latter is the right choice when you want every class to count equally even if one is rare. Without an F1-anchored regression, you ship classifiers that look improved on accuracy and have actually regressed on the class your users care about.

How FutureAGI Handles F1 Score

FutureAGI’s approach is to expose F1 as an aggregation over per-row classifier evaluators rather than as a separate metric class — because F1 is meaningful only across a cohort, never on a single sample. The path is: pick a classifier-style evaluator from fi.evals that returns a per-row label or 0/1 score, attach it to a Dataset via Dataset.add_evaluation(), and aggregate the dataset results into precision, recall, and F1 per class. The canonical evaluators are fi.evals.Equals (strict label equality), fi.evals.FuzzyMatch (graded label match, useful when references are noisy), and fi.evals.GroundTruthMatch (the cloud-template classifier evaluator with built-in normalisation). For ranked retrieval, fi.evals.PrecisionAtK and fi.evals.RecallAtK cover the ranked-list variants directly.

Concretely: a customer-intent-routing team using traceAI-openai runs GroundTruthMatch on a 12k-row labelled dataset across 24 intent classes. They compute per-class precision and recall, derive macro-F1, and alert when any class drops below 0.78. When a prompt change shifts macro-F1 from 0.86 to 0.81, they drill into per-class F1, find the regression on one rare class (“escalate-to-supervisor”), and ship a few-shot example targeting that class. Compared with sklearn’s f1_score() (offline-only, no traceability) the FutureAGI workflow stores the per-row labels in the dataset, so a regression in F1 is one click away from the offending traces. FutureAGI’s approach is to make F1 a derived view over reproducible per-row evaluations, not a black-box scalar.

How to Measure or Detect It

Bullet-list of measurement signals tied to F1:

Per-class F1 from fi.evals.Equals or GroundTruthMatch — aggregate per-row labels into TP/FP/FN, derive precision, recall, F1 per class. Threshold per class.
Macro-F1 dashboard signal — average of per-class F1 across all classes; the right summary when class importance is uniform regardless of frequency.
Confusion matrix — the per-cell TP/FP/FN/TN counts F1 is derived from; pair with the F1 alert so you see which class is dropping.
PrecisionAtK / RecallAtK — the ranked-retrieval cousins; combine into F1@K when reranker quality is the target.

Minimal Python (aggregate F1 from per-row evaluator results):

from fi.evals import GroundTruthMatch
from sklearn.metrics import f1_score

metric = GroundTruthMatch()
y_true, y_pred = [], []
for row in dataset:
    res = metric.evaluate(response=row.output,
                          expected_response=row.label)
    y_true.append(row.label)
    y_pred.append(row.output if res.score == 1.0 else "OTHER")
print(f1_score(y_true, y_pred, average="macro"))

Common Mistakes

Reporting micro-F1 when classes are imbalanced. Micro-F1 collapses to accuracy on imbalanced datasets and hides regression on minority classes — use macro-F1 instead.
Computing F1 on a single sample. F1 is a cohort metric; per-row scores are precision/recall components, not F1 values themselves.
Confusing the F1 of an LLM judge with the F1 of the underlying classifier. When the judge is itself imperfect, your F1 is bounded by the judge’s agreement with humans; calibrate the judge first.
Treating F1 as universally applicable. For ranked retrieval, use Precision@K, Recall@K, NDCG; for open-ended generation, use embedding similarity or judge-model rubrics. F1 is a classifier metric.
Skipping the per-class breakdown. A flat macro-F1 hides which class regressed; always emit per-class F1 alongside the aggregate.