Guides

F1 Score in 2026: Formula, Macro vs Micro vs Weighted, When to Use It, and Working Sklearn Code

F1 Score for classification in 2026: harmonic mean of precision and recall, the math, macro vs micro vs weighted, when to use it, and a sklearn code example.

·
Updated
·
8 min read
evaluations data quality classification metrics f1-score
F1 Score in 2026: harmonic mean of precision and recall, with variants and Python code
Table of Contents

TL;DR: F1 Score at a glance

WhatFormulaUse it when
PrecisionTP / (TP + FP)Cost of false positive is high
RecallTP / (TP + FN)Cost of false negative is high
F1 Score2 * (Precision * Recall) / (Precision + Recall)Both costs matter and classes are imbalanced
F-beta(1 + beta^2) * P * R / (beta^2 * P + R)You need to weight precision vs recall
Macro F1Mean of per-class F1Every class matters equally
Micro F1F1 on pooled TP, FP, FNOverall correctness; equals accuracy on multi-class single-label
Weighted F1Per-class F1 weighted by class frequencyLarger classes should count more

F1 is one metric in a broader confusion-matrix toolbox. Use it for imbalanced classification and report it next to precision, recall, the confusion matrix, and where probabilities matter, AUC-ROC or AUC-PR.

Why F1 still matters in 2026

The F1 score has been around since the early years of information retrieval. The math hasn’t changed; the surrounding stack has. In 2026, classification problems show up in three new contexts that all need a balanced precision-recall metric:

  1. LLM-as-classifier. Prompt an LLM to label intents, route tickets, gate content, or score retrieval relevance. Score with F1 against a labeled test set.
  2. Imbalanced-by-default. Fraud, abuse, safety incidents, customer churn, RAG retrieval relevance. All commonly imbalanced. Accuracy can be misleading; F1 is more informative when both error types matter.
  3. Calibrated thresholds. Many production classifiers benefit from calibration methods such as Platt scaling or isotonic regression before thresholding. F1 at the calibrated threshold is the production-realistic number.

This post is the working reference. The formula, the variants, when each applies, the Sklearn code, and where it fits into a modern LLM-classification pipeline.

Confusion matrix building blocks

Every classification metric reduces to four counts:

  • True Positives (TP): model predicted positive, ground truth is positive.
  • False Positives (FP): model predicted positive, ground truth is negative. Type I error.
  • False Negatives (FN): model predicted negative, ground truth is positive. Type II error.
  • True Negatives (TN): model predicted negative, ground truth is negative.

Precision and recall are ratios of these counts.

Precision

Precision = TP / (TP + FP)

Precision answers: of everything you flagged positive, what fraction was correct. High precision means few false alarms. Spam filtering, content moderation, and any user-facing classification where wrong flags cause real friction prioritize precision.

Recall

Recall = TP / (TP + FN)

Recall (also called sensitivity or true positive rate) answers: of all the actual positives, what fraction did you catch. High recall means few misses. Disease screening, fraud detection, and any safety-critical pipeline prioritize recall.

Accuracy

Accuracy = (TP + TN) / (TP + TN + FP + FN)

Accuracy is the overall fraction of correct predictions. On imbalanced data it is misleading because a degenerate classifier that always predicts the majority class gets a high accuracy without doing anything useful. A 99% accurate fraud detector that says “not fraud” to everyone has 0% recall.

Specificity

Specificity = TN / (TN + FP)

Specificity is the true-negative rate. Useful in safety-critical settings (alert systems, medical screening) where false alarms are costly. F1 does not summarize specificity; use it alongside F1 when it matters.

The F1 formula

F1 is the harmonic mean of precision and recall.

F1 = 2 * (Precision * Recall) / (Precision + Recall)

The harmonic mean has a useful property: it is dominated by the smaller of the two inputs. A model with precision 0.99 and recall 0.10 has an F1 of about 0.18, not 0.55 (the arithmetic mean). That is the point. F1 forces you to be good at both.

Worked example

Given:

  • TP = 40
  • FP = 10
  • FN = 20

Then:

  • Precision = 40 / (40 + 10) = 0.80
  • Recall = 40 / (40 + 20) = 0.667
  • F1 = 2 * (0.80 * 0.667) / (0.80 + 0.667) = 1.067 / 1.467 = 0.727

An F1 of 0.73 says the model is balanced on the positive class. Whether 0.73 is good depends entirely on the baseline and the use case.

When to use F1, and when not to

Use F1 when

  • Classes are imbalanced (one class is much rarer than the other). Accuracy can be misleading; F1 is often more useful when precision and recall both matter.
  • Both false positives and false negatives have real cost. F1 forces you to optimize for both.
  • You want a single number to compare models. F1 is more discriminating than accuracy on the cases that matter.
  • You’re scoring an LLM classifier. Prompted classifiers produce categorical outputs; F1 is useful, but choose binary, macro, weighted, or per-class F1 based on the label structure.

Do not use F1 when

  • Classes are balanced and the costs are roughly symmetric. Accuracy is simpler.
  • Business clearly cares about one side of the error. Report precision or recall directly, or pick F-beta with the appropriate beta.
  • You are choosing a threshold. AUC-PR or AUC-ROC summarize ranking quality across thresholds. F1 only summarizes one threshold.
  • You’re doing ranking, retrieval, or recommendation. Use MAP, NDCG, MRR, or recall@k instead.

F1 variants

F-beta

The general form weights precision and recall asymmetrically:

F_beta = (1 + beta^2) * Precision * Recall / (beta^2 * Precision + Recall)

  • beta = 1 recovers F1.
  • beta > 1 (for example F2) gives recall more weight. Use for fraud detection, screening, abuse classification.
  • beta < 1 (for example F0.5) gives precision more weight. Use for spam filtering, content moderation gates, legal-review precision targets.

F-beta is more honest than picking F1 and pretending it is symmetric when the business cost is not.

Macro F1

Compute F1 separately for each class, then average:

Macro F1 = mean(F1_class_1, F1_class_2, ..., F1_class_K)

Every class counts equally. Use macro F1 when minority classes matter as much as majority classes (rare-disease detection, low-frequency intent classification).

Micro F1

Pool TP, FP, FN across all classes, then compute one F1:

Micro F1 = F1(sum_TP, sum_FP, sum_FN)

On multi-class single-label problems, micro F1 equals accuracy. It weights each example equally rather than each class. Use micro F1 when overall correctness matters more than per-class fairness.

Weighted F1

Compute per-class F1, then average weighted by the number of true instances of each class:

Weighted F1 = sum(n_class_k * F1_class_k) / total_n

Use weighted F1 when larger classes are more important (revenue-weighted classification, frequency-weighted routing).

Picking among macro, micro, weighted

  • Every class matters equally: macro F1.
  • Bigger classes matter more: weighted F1.
  • Per-example correctness matters: micro F1 (equals accuracy on single-label multi-class).

Most production teams report all three plus per-class precision and recall in the model card.

F1 in production: thresholds and calibration

The F1 of a probabilistic classifier depends on the threshold you use to convert scores into class labels. A classifier with great AUC-PR can still have a poor F1 if the threshold is chosen badly.

Two things matter in production:

  • Calibrate the probabilities before thresholding. Use Platt scaling (logistic calibration) or isotonic regression. A calibrated classifier means the predicted probability matches the empirical positive rate.
  • Pick the threshold on a held-out calibration set, not on the test set. Search thresholds in [0.01, 0.99] and pick the one maximizing F1 (or F-beta) on the calibration set. Then report F1 on the test set at that fixed threshold.

Skipping calibration and threshold selection is one of the most common reasons production F1 lags offline F1.

Implementing F1 in Python

Scikit-learn (BSD-3 license, scikit-learn.org) is the standard reference implementation.

from sklearn.metrics import f1_score, precision_score, recall_score

y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 1]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 1]

precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)

print(f"Precision: {precision:.3f}")
print(f"Recall:    {recall:.3f}")
print(f"F1 Score:  {f1:.3f}")

For multi-class problems, sklearn.metrics.f1_score takes an average argument: 'binary' (default for two-class), 'macro', 'micro', 'weighted', or None (returns one F1 per class).

from sklearn.metrics import f1_score, classification_report

y_true = [0, 1, 2, 0, 1, 2, 0, 1, 2]
y_pred = [0, 2, 1, 0, 0, 2, 0, 1, 2]

print("Macro F1:    ", f1_score(y_true, y_pred, average="macro"))
print("Micro F1:    ", f1_score(y_true, y_pred, average="micro"))
print("Weighted F1: ", f1_score(y_true, y_pred, average="weighted"))
print()
print(classification_report(y_true, y_pred, digits=3))

The classification_report output gives per-class precision, recall, F1, and support in one block. This is what most production model cards include.

F1 for LLM classification tasks

LLMs prompted as classifiers (intent classification, content moderation, support routing, retrieval relevance, label assignment) are evaluated with the same precision-recall-F1 toolbox as traditional classifiers, plus an LLM-specific concern: the LLM may produce malformed outputs.

A pragmatic LLM-classification eval pipeline:

import json
from sklearn.metrics import f1_score, classification_report

def call_llm_classifier(prompt: str, text: str) -> str:
    """Stub. Wire to your real LLM call (OpenAI, Anthropic, etc.)."""
    return "{}"  # placeholder

def parse_label(raw: str, classes: list[str]) -> str | None:
    """Extract a valid class label from raw LLM output, or None on failure."""
    try:
        obj = json.loads(raw)
        label = obj.get("label")
        return label if label in classes else None
    except json.JSONDecodeError:
        return None

CLASSES = ["billing", "technical", "account", "other"]
texts = [
    "I was charged twice for the same month.",
    "My password reset email never arrived.",
    "How do I delete my account?",
    "Can you ship to Canada?",
]
y_true = ["billing", "account", "account", "other"]

PROMPT = (
    "Classify the support message into one of: billing, technical, account, other. "
    "Respond with JSON: {\"label\": \"<class>\"}."
)

y_pred = []
parse_failures = 0
for text in texts:
    raw = call_llm_classifier(PROMPT, text)
    label = parse_label(raw, CLASSES)
    if label is None:
        parse_failures += 1
        label = "other"  # or some default fallback
    y_pred.append(label)

print(f"Parse failures: {parse_failures}/{len(texts)}")
print(classification_report(y_true, y_pred, labels=CLASSES, digits=3))
print("Macro F1:    ", f1_score(y_true, y_pred, average="macro"))

Two things to track for an LLM classifier that you do not track for a traditional classifier:

  1. Format-failure rate. What fraction of outputs failed to parse into a valid class. A classifier with a 5% format-failure rate has hidden metric degradation (across precision, recall, or both, depending on how you handle the fallback) before you even compare predictions.
  2. Instruction-adherence score. Did the model follow the prompt format, refuse cases it was not supposed to refuse, or pick a class outside the allowed list. Future AGI’s evaluate(eval_templates="prompt_adherence", ...) is one way to score this alongside F1.

Pairing format checks with classical F1 is the difference between “the LLM is bad” and “the LLM is good but the prompt is wrong.”

Common mistakes when using F1

  1. Reporting F1 in isolation. Always report precision, recall, the confusion matrix, and class support alongside F1. The four together tell a story F1 alone cannot.
  2. Using accuracy on imbalanced data. A 99%-accurate fraud detector with 0% recall is not a useful model. Switch to F1 or recall.
  3. Comparing F1 across different thresholds. F1 depends on the threshold. Pick the threshold on a calibration set, hold it fixed, then compare.
  4. Confusing micro F1 with accuracy. On multi-class single-label problems they are equal. On multi-label problems they differ.
  5. Ignoring per-class F1. A macro F1 of 0.80 across five classes can hide one class at 0.40. Always look at per-class F1 before declaring victory.
  6. Optimizing for F1 when the business is asymmetric. If false negatives are 10x more expensive than false positives, F1 is the wrong loss. Use F2 or report recall as the primary metric.

Where Future AGI fits

F1 is a classical metric and you do not need a managed platform to compute it. Where Future AGI helps is the layer above: when an LLM is the classifier and you need to score not just whether the label is right but whether the model followed the prompt, refused appropriately, leaked PII, or hallucinated a class outside the allowed list.

from fi.evals import evaluate

result = evaluate(
    eval_templates="prompt_adherence",
    inputs={
        "input": "Classify the support message into: billing, technical, account, other.",
        "output": '{"label": "billing"}',
    },
    model_name="turing_flash",
)

print(result.eval_results[0].metrics[0].value, result.eval_results[0].reason)

Pair prompt_adherence with a per-class F1 score from scikit-learn and you cover both the labeling correctness and the output-shape correctness in one evaluation pass. The Future AGI Agent Command Center at /platform/monitor/command-center surfaces failing live runs so you can pull them into the next regression set.

Frequently asked questions

How is the F1 Score calculated?
F1 is the harmonic mean of precision and recall. First compute precision as TP divided by (TP + FP), and recall as TP divided by (TP + FN). Then F1 equals 2 times precision times recall, divided by the sum of precision and recall. The harmonic mean penalizes models that are strong on one of the two and weak on the other, so a high F1 requires both precision and recall to be high.
When should you use the F1 Score?
Use F1 when both false positives and false negatives are costly and the classes are imbalanced. Fraud detection, medical diagnosis, intrusion detection, and minority-class document classification are canonical use cases. Avoid F1 alone when the classes are balanced and accuracy already tells the right story, or when the business clearly cares about one side of the error far more than the other (in which case report precision or recall directly, or pick an asymmetric F-beta).
What are macro, micro, and weighted F1?
Macro F1 computes F1 per class and averages the results, weighting each class equally. Micro F1 pools true positives, false positives, and false negatives across all classes before computing one F1; on a multi-class single-label problem it equals accuracy. Weighted F1 averages per-class F1 weighted by the number of true instances of each class. Pick macro when minority classes matter equally, weighted when class frequency reflects business value, and micro when overall correctness matters more than per-class fairness.
Can the F1 Score be misleading?
Yes. F1 ignores true negatives, so it does not summarize specificity. It also collapses precision and recall into a single number, hiding tradeoffs. On heavily imbalanced data the macro F1 can be dragged down by a single underperforming class, while weighted F1 can hide poor minority-class performance. Always report F1 alongside precision, recall, the confusion matrix, and where probabilities matter, AUC-ROC or AUC-PR.
What is the F-beta score?
F-beta is a generalization of F1 where a parameter beta shifts the weight between precision and recall. F1 is F-beta with beta equal to 1. When beta is greater than 1, recall counts more, useful for fraud detection or medical diagnosis where missing a positive is expensive. When beta is less than 1, precision counts more, useful for spam filtering or any case where false positives have a high cost.
What changed in F1 reporting since 2025?
The math is unchanged because F1 is a fixed formula. What changed is the surrounding stack. Imbalanced-learning toolkits like imbalanced-learn are commonly used alongside scikit-learn, calibration of probabilities through Platt scaling or isotonic regression is now common before thresholding into F1, and LLM-classification pipelines started reporting F1 against held-out, contamination-checked test sets rather than the public benchmark splits because public splits are increasingly suspected of leakage.
How does F1 apply to LLM classification tasks?
When an LLM is prompted to label inputs (intent classification, content moderation, support routing, retrieval relevance), you treat the output as a classifier and compute precision, recall, and F1 against a labeled test set. Use macro F1 across classes when minority labels matter equally, and use a Future AGI evaluator (instruction adherence plus a custom LLM-as-judge metric) to check whether the LLM is failing because of a labeling mistake or a non-compliant output format.
What is the difference between F1, accuracy, and AUC-ROC?
Accuracy is the fraction of correct predictions across all classes; it is misleading on imbalanced data. F1 is the harmonic mean of precision and recall and ignores true negatives, so it focuses on positive-class performance. AUC-ROC is threshold-independent and summarizes the ranking quality of the probabilistic classifier across thresholds. Use F1 when you have a fixed decision threshold, AUC-ROC when you are choosing a threshold, and accuracy when the classes are balanced.
Related Articles
View all
Stay updated on AI observability

Get weekly insights on building reliable AI systems. No spam.