F1 Score in 2026: Formula, Macro vs Micro vs Weighted, When to Use It, and Working Sklearn Code
F1 Score for classification in 2026: harmonic mean of precision and recall, the math, macro vs micro vs weighted, when to use it, and a sklearn code example.
Table of Contents
TL;DR: F1 Score at a glance
| What | Formula | Use it when |
|---|---|---|
| Precision | TP / (TP + FP) | Cost of false positive is high |
| Recall | TP / (TP + FN) | Cost of false negative is high |
| F1 Score | 2 * (Precision * Recall) / (Precision + Recall) | Both costs matter and classes are imbalanced |
| F-beta | (1 + beta^2) * P * R / (beta^2 * P + R) | You need to weight precision vs recall |
| Macro F1 | Mean of per-class F1 | Every class matters equally |
| Micro F1 | F1 on pooled TP, FP, FN | Overall correctness; equals accuracy on multi-class single-label |
| Weighted F1 | Per-class F1 weighted by class frequency | Larger classes should count more |
F1 is one metric in a broader confusion-matrix toolbox. Use it for imbalanced classification and report it next to precision, recall, the confusion matrix, and where probabilities matter, AUC-ROC or AUC-PR.
Why F1 still matters in 2026
The F1 score has been around since the early years of information retrieval. The math hasn’t changed; the surrounding stack has. In 2026, classification problems show up in three new contexts that all need a balanced precision-recall metric:
- LLM-as-classifier. Prompt an LLM to label intents, route tickets, gate content, or score retrieval relevance. Score with F1 against a labeled test set.
- Imbalanced-by-default. Fraud, abuse, safety incidents, customer churn, RAG retrieval relevance. All commonly imbalanced. Accuracy can be misleading; F1 is more informative when both error types matter.
- Calibrated thresholds. Many production classifiers benefit from calibration methods such as Platt scaling or isotonic regression before thresholding. F1 at the calibrated threshold is the production-realistic number.
This post is the working reference. The formula, the variants, when each applies, the Sklearn code, and where it fits into a modern LLM-classification pipeline.
Confusion matrix building blocks
Every classification metric reduces to four counts:
- True Positives (TP): model predicted positive, ground truth is positive.
- False Positives (FP): model predicted positive, ground truth is negative. Type I error.
- False Negatives (FN): model predicted negative, ground truth is positive. Type II error.
- True Negatives (TN): model predicted negative, ground truth is negative.
Precision and recall are ratios of these counts.
Precision
Precision = TP / (TP + FP)
Precision answers: of everything you flagged positive, what fraction was correct. High precision means few false alarms. Spam filtering, content moderation, and any user-facing classification where wrong flags cause real friction prioritize precision.
Recall
Recall = TP / (TP + FN)
Recall (also called sensitivity or true positive rate) answers: of all the actual positives, what fraction did you catch. High recall means few misses. Disease screening, fraud detection, and any safety-critical pipeline prioritize recall.
Accuracy
Accuracy = (TP + TN) / (TP + TN + FP + FN)
Accuracy is the overall fraction of correct predictions. On imbalanced data it is misleading because a degenerate classifier that always predicts the majority class gets a high accuracy without doing anything useful. A 99% accurate fraud detector that says “not fraud” to everyone has 0% recall.
Specificity
Specificity = TN / (TN + FP)
Specificity is the true-negative rate. Useful in safety-critical settings (alert systems, medical screening) where false alarms are costly. F1 does not summarize specificity; use it alongside F1 when it matters.
The F1 formula
F1 is the harmonic mean of precision and recall.
F1 = 2 * (Precision * Recall) / (Precision + Recall)
The harmonic mean has a useful property: it is dominated by the smaller of the two inputs. A model with precision 0.99 and recall 0.10 has an F1 of about 0.18, not 0.55 (the arithmetic mean). That is the point. F1 forces you to be good at both.
Worked example
Given:
- TP = 40
- FP = 10
- FN = 20
Then:
Precision = 40 / (40 + 10) = 0.80Recall = 40 / (40 + 20) = 0.667F1 = 2 * (0.80 * 0.667) / (0.80 + 0.667) = 1.067 / 1.467 = 0.727
An F1 of 0.73 says the model is balanced on the positive class. Whether 0.73 is good depends entirely on the baseline and the use case.
When to use F1, and when not to
Use F1 when
- Classes are imbalanced (one class is much rarer than the other). Accuracy can be misleading; F1 is often more useful when precision and recall both matter.
- Both false positives and false negatives have real cost. F1 forces you to optimize for both.
- You want a single number to compare models. F1 is more discriminating than accuracy on the cases that matter.
- You’re scoring an LLM classifier. Prompted classifiers produce categorical outputs; F1 is useful, but choose binary, macro, weighted, or per-class F1 based on the label structure.
Do not use F1 when
- Classes are balanced and the costs are roughly symmetric. Accuracy is simpler.
- Business clearly cares about one side of the error. Report precision or recall directly, or pick F-beta with the appropriate beta.
- You are choosing a threshold. AUC-PR or AUC-ROC summarize ranking quality across thresholds. F1 only summarizes one threshold.
- You’re doing ranking, retrieval, or recommendation. Use MAP, NDCG, MRR, or recall@k instead.
F1 variants
F-beta
The general form weights precision and recall asymmetrically:
F_beta = (1 + beta^2) * Precision * Recall / (beta^2 * Precision + Recall)
beta = 1recovers F1.beta > 1(for example F2) gives recall more weight. Use for fraud detection, screening, abuse classification.beta < 1(for example F0.5) gives precision more weight. Use for spam filtering, content moderation gates, legal-review precision targets.
F-beta is more honest than picking F1 and pretending it is symmetric when the business cost is not.
Macro F1
Compute F1 separately for each class, then average:
Macro F1 = mean(F1_class_1, F1_class_2, ..., F1_class_K)
Every class counts equally. Use macro F1 when minority classes matter as much as majority classes (rare-disease detection, low-frequency intent classification).
Micro F1
Pool TP, FP, FN across all classes, then compute one F1:
Micro F1 = F1(sum_TP, sum_FP, sum_FN)
On multi-class single-label problems, micro F1 equals accuracy. It weights each example equally rather than each class. Use micro F1 when overall correctness matters more than per-class fairness.
Weighted F1
Compute per-class F1, then average weighted by the number of true instances of each class:
Weighted F1 = sum(n_class_k * F1_class_k) / total_n
Use weighted F1 when larger classes are more important (revenue-weighted classification, frequency-weighted routing).
Picking among macro, micro, weighted
- Every class matters equally: macro F1.
- Bigger classes matter more: weighted F1.
- Per-example correctness matters: micro F1 (equals accuracy on single-label multi-class).
Most production teams report all three plus per-class precision and recall in the model card.
F1 in production: thresholds and calibration
The F1 of a probabilistic classifier depends on the threshold you use to convert scores into class labels. A classifier with great AUC-PR can still have a poor F1 if the threshold is chosen badly.
Two things matter in production:
- Calibrate the probabilities before thresholding. Use Platt scaling (logistic calibration) or isotonic regression. A calibrated classifier means the predicted probability matches the empirical positive rate.
- Pick the threshold on a held-out calibration set, not on the test set. Search thresholds in [0.01, 0.99] and pick the one maximizing F1 (or F-beta) on the calibration set. Then report F1 on the test set at that fixed threshold.
Skipping calibration and threshold selection is one of the most common reasons production F1 lags offline F1.
Implementing F1 in Python
Scikit-learn (BSD-3 license, scikit-learn.org) is the standard reference implementation.
from sklearn.metrics import f1_score, precision_score, recall_score
y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 1]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 1]
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)
print(f"Precision: {precision:.3f}")
print(f"Recall: {recall:.3f}")
print(f"F1 Score: {f1:.3f}")
For multi-class problems, sklearn.metrics.f1_score takes an average argument: 'binary' (default for two-class), 'macro', 'micro', 'weighted', or None (returns one F1 per class).
from sklearn.metrics import f1_score, classification_report
y_true = [0, 1, 2, 0, 1, 2, 0, 1, 2]
y_pred = [0, 2, 1, 0, 0, 2, 0, 1, 2]
print("Macro F1: ", f1_score(y_true, y_pred, average="macro"))
print("Micro F1: ", f1_score(y_true, y_pred, average="micro"))
print("Weighted F1: ", f1_score(y_true, y_pred, average="weighted"))
print()
print(classification_report(y_true, y_pred, digits=3))
The classification_report output gives per-class precision, recall, F1, and support in one block. This is what most production model cards include.
F1 for LLM classification tasks
LLMs prompted as classifiers (intent classification, content moderation, support routing, retrieval relevance, label assignment) are evaluated with the same precision-recall-F1 toolbox as traditional classifiers, plus an LLM-specific concern: the LLM may produce malformed outputs.
A pragmatic LLM-classification eval pipeline:
import json
from sklearn.metrics import f1_score, classification_report
def call_llm_classifier(prompt: str, text: str) -> str:
"""Stub. Wire to your real LLM call (OpenAI, Anthropic, etc.)."""
return "{}" # placeholder
def parse_label(raw: str, classes: list[str]) -> str | None:
"""Extract a valid class label from raw LLM output, or None on failure."""
try:
obj = json.loads(raw)
label = obj.get("label")
return label if label in classes else None
except json.JSONDecodeError:
return None
CLASSES = ["billing", "technical", "account", "other"]
texts = [
"I was charged twice for the same month.",
"My password reset email never arrived.",
"How do I delete my account?",
"Can you ship to Canada?",
]
y_true = ["billing", "account", "account", "other"]
PROMPT = (
"Classify the support message into one of: billing, technical, account, other. "
"Respond with JSON: {\"label\": \"<class>\"}."
)
y_pred = []
parse_failures = 0
for text in texts:
raw = call_llm_classifier(PROMPT, text)
label = parse_label(raw, CLASSES)
if label is None:
parse_failures += 1
label = "other" # or some default fallback
y_pred.append(label)
print(f"Parse failures: {parse_failures}/{len(texts)}")
print(classification_report(y_true, y_pred, labels=CLASSES, digits=3))
print("Macro F1: ", f1_score(y_true, y_pred, average="macro"))
Two things to track for an LLM classifier that you do not track for a traditional classifier:
- Format-failure rate. What fraction of outputs failed to parse into a valid class. A classifier with a 5% format-failure rate has hidden metric degradation (across precision, recall, or both, depending on how you handle the fallback) before you even compare predictions.
- Instruction-adherence score. Did the model follow the prompt format, refuse cases it was not supposed to refuse, or pick a class outside the allowed list. Future AGI’s
evaluate(eval_templates="prompt_adherence", ...)is one way to score this alongside F1.
Pairing format checks with classical F1 is the difference between “the LLM is bad” and “the LLM is good but the prompt is wrong.”
Common mistakes when using F1
- Reporting F1 in isolation. Always report precision, recall, the confusion matrix, and class support alongside F1. The four together tell a story F1 alone cannot.
- Using accuracy on imbalanced data. A 99%-accurate fraud detector with 0% recall is not a useful model. Switch to F1 or recall.
- Comparing F1 across different thresholds. F1 depends on the threshold. Pick the threshold on a calibration set, hold it fixed, then compare.
- Confusing micro F1 with accuracy. On multi-class single-label problems they are equal. On multi-label problems they differ.
- Ignoring per-class F1. A macro F1 of 0.80 across five classes can hide one class at 0.40. Always look at per-class F1 before declaring victory.
- Optimizing for F1 when the business is asymmetric. If false negatives are 10x more expensive than false positives, F1 is the wrong loss. Use F2 or report recall as the primary metric.
Where Future AGI fits
F1 is a classical metric and you do not need a managed platform to compute it. Where Future AGI helps is the layer above: when an LLM is the classifier and you need to score not just whether the label is right but whether the model followed the prompt, refused appropriately, leaked PII, or hallucinated a class outside the allowed list.
from fi.evals import evaluate
result = evaluate(
eval_templates="prompt_adherence",
inputs={
"input": "Classify the support message into: billing, technical, account, other.",
"output": '{"label": "billing"}',
},
model_name="turing_flash",
)
print(result.eval_results[0].metrics[0].value, result.eval_results[0].reason)
Pair prompt_adherence with a per-class F1 score from scikit-learn and you cover both the labeling correctness and the output-shape correctness in one evaluation pass. The Future AGI Agent Command Center at /platform/monitor/command-center surfaces failing live runs so you can pull them into the next regression set.
Frequently asked questions
How is the F1 Score calculated?
When should you use the F1 Score?
What are macro, micro, and weighted F1?
Can the F1 Score be misleading?
What is the F-beta score?
What changed in F1 reporting since 2025?
How does F1 apply to LLM classification tasks?
What is the difference between F1, accuracy, and AUC-ROC?
RAG eval metrics in 2026: faithfulness, context precision, recall, groundedness, answer relevance, hallucination. With FAGI fi.evals templates.
Future AGI, DeepEval, RAGAS, Arize Phoenix, OpenAI Evals, and LangSmith ranked for LLM evaluation in 2026. Metrics taxonomy, eval templates, best practices.
Map enterprise LLMs to GDPR, EU AI Act and NIST AI RMF in 2026: input/output guardrails, bias audits, explainability, and a real FAGI Protect setup.