How is F-score different from accuracy?

Accuracy is correct predictions over total predictions and breaks on imbalanced classes — a 99% always-negative classifier has 99% accuracy on a 1% positive class. F-score penalizes both missed positives and false positives, so it stays meaningful under imbalance.

How do you compute F-score in FutureAGI?

FutureAGI's fi.evals exposes precision, recall, and F1 for label-based evaluations on Datasets. For NLI-style checks, FactualConsistency returns a related score plus reason.

What Is F-Score? FutureAGI Guide (2026)

What Is F-Score?

F-score is the harmonic mean of precision and recall, used to summarize a classifier’s performance with a single number when both false positives and false negatives matter. The most common variant is F1 (equal weight on precision and recall). F-beta is the generalization: F2 weights recall higher (when missing a positive is worse), F0.5 weights precision higher (when a false positive is worse). It is the default scalar for binary and multi-class classification, including LLM-era tasks: prompt-injection detection, intent classification, content-safety, and entity extraction. In FutureAGI, F1 sits alongside precision and recall as a per-label metric attached to a Dataset.

Why It Matters in Production LLM and Agent Systems

Reporting accuracy alone on imbalanced data is the canonical wrong move. A prompt-injection classifier that always says “safe” can hit 99% accuracy on production traffic where 1% of inputs are actual injections — and miss every real attack. F-score forces the conversation about which class the system is actually catching. The pain shows up across roles. A security team sees “99% accuracy” headlines and discovers later that recall on the attack class is 4%. A product manager picks a content-safety model on accuracy and discovers it under-flags toxic content because the toxic class is rare. An ML engineer tunes a classifier threshold without checking that precision and recall move in opposite directions and ships the wrong trade-off.

Common production symptoms are class-imbalanced: high accuracy with poor recall on the rare-but-important class; precision collapsing as the team chases recall; F1 rising on aggregate while one minority class gets worse. None of these show up on a single accuracy number.

In 2026-era stacks, F-score still matters wherever an LLM is being used as a classifier — and that is increasingly common. Pre-guardrail injection detection, intent routing, content-safety pre-filters, and tool selection are all classification problems with imbalanced classes. Multi-class F1 (macro or weighted) is the default scalar for these, and per-class precision/recall is the diagnostic for which class is failing.

How FutureAGI Handles F-Score

FutureAGI’s approach is to make precision, recall, and F1 first-class signals for classifier-style evaluations on Dataset runs. For label-based evaluation, attach RecallScore for retrieval recall, plus standard precision/recall/F1 calculation against ground-truth labels stored in the Dataset; results are stored per-label and per-cohort. For LLM-judge classification, CustomEvaluation wraps a judge prompt that returns a category, and the resulting confusion matrix produces precision, recall, and F1 per class. For security-flavored classification, PromptInjection and ProtectFlash produce per-input scores; thresholding them on a labeled Dataset yields precision/recall/F1 the security team can use to set the production threshold.

A practical pattern: a content-moderation team is comparing a frontier classifier with a cheaper open-weight option. They run both against a labeled Dataset of 5,000 production-shaped examples (3% toxic), compute per-class precision, recall, and F1, and dashboard the macro-F1 plus per-class recall on the toxic class. The cheaper model has higher accuracy (98.4% vs 98.1%) but lower toxic-class recall (0.61 vs 0.78). They keep the frontier model on high-risk routes and use Agent Command Center cost-optimized-routing to send low-risk routes through the cheaper one, gated by a post-guardrail. Unlike a single accuracy headline, the F-score per class made the cost/quality trade-off explicit.

How to Measure or Detect It

F-score is itself a measurement; the surrounding signals you also need:

Precision: TP / (TP + FP); reported per class.
Recall: TP / (TP + FN); reported per class.
F1 / F-beta: harmonic mean; macro or weighted across classes.
Confusion matrix: the underlying TP/FP/FN/TN counts that all of these derive from.
Per-cohort F1 (dashboard signal): F1 sliced by user cohort, locale, or route — global F1 hides cohort-level regressions.

Minimal Python:

from fi.evals import RecallScore

recall = RecallScore()
labels = [row.label for row in dataset]
preds = [classify(row.input) for row in dataset]
# precision, recall, f1 per class via standard confusion-matrix math:
import sklearn.metrics as m
print(m.classification_report(labels, preds))

Common Mistakes

Reporting accuracy on imbalanced classes. A 99%-accurate classifier that misses every minority-class example is worse than no classifier on the task that matters.
Optimizing F1 by tuning threshold without rechecking precision and recall separately. F1 can hide a major precision drop with a recall gain that wasn’t worth it.
Using macro-F1 when class importance differs. Weighted-F1 or per-class minimums make the trade-off explicit.
Treating F0.5 and F2 as exotic. They exist because real systems care about precision or recall asymmetrically — pick the right beta on purpose.
No per-cohort breakdown. Aggregate F1 hides the cohorts where the classifier degrades; always slice.