Evaluation

What Is FinBen Domain-Specific Benchmark?

An open finance-domain LLM benchmark covering extraction, analysis, QA, generation, risk, forecasting, and decision-making tasks across English and Chinese corpora.

What Is FinBen Domain-Specific Benchmark?

FinBen is a public domain-specific benchmark for evaluating large language models on finance tasks. It bundles datasets across information extraction, textual analysis, question answering, text generation, risk management, forecasting, and decision-making in both English and Chinese. Released in 2024 by the FinNLP / PIXIU community and extended through 2025-2026, it gives finance-focused teams a unified yardstick when saturated general benchmarks fail to distinguish whether a model can actually read a 10-K, classify credit risk, or summarize an earnings call.

By May 2026, MMLU, MT-Bench, and GSM8K are all saturated for frontier models. GPT-5.x, Claude Opus 4.7, and Gemini 3 Pro cluster within a couple of points on each. FinBen still discriminates, and that is exactly why finance teams treat it as a domain floor before any private eval work begins.

Why FinBen matters in production LLM and agent systems

A model that scores 88 on MMLU can still get an earnings-call summary catastrophically wrong. confusing GAAP vs. non-GAAP numbers, dropping a guidance revision, or hallucinating a segment that does not exist in the transcript. Generic benchmarks miss this because they do not test finance-specific reading, table parsing, currency math, or regulatory framing. FinBen exists to close that gap with concrete, reproducible tasks pulled from financial corpora.

The pain shows up across roles. A platform engineer onboards a new “frontier” model and finds that summarizations of 10-Q sections silently lose footnote disclosures. An ML lead picks a cheaper open-weight model based on chatbot-arena rank and watches credit-classification accuracy drop 12 points in offline tests. A compliance owner is asked, mid-audit, “how did you choose this model for risk-flagging?”. and FinBen-style benchmark numbers, paired with private evals, are the cleanest answer.

For 2026 agent stacks that pull SEC filings, internal databases, and live market data through tool calls. increasingly via MCP. the finance-task floor matters more than overall fluency. A FinBen result tells you whether the reasoning core handles the domain at all; trajectory-level evals over your real prompts, retrievers, and guardrails tell you whether the agent works for your users.

How FutureAGI handles FinBen-style evaluation

FutureAGI does not host the FinBen leaderboard; we sit one layer below it, where teams run private FinBen-shaped evals against their own data. The workflow uses three FAGI surfaces: a versioned Dataset, the Dataset.add_evaluation() interface, and fi.evals evaluators wired to OpenTelemetry traces.

Concretely: a fintech team imports the FinBen QA subset (or a private extension built from their own filings and analyst notes) as a Dataset, then attaches FactualAccuracy, AnswerRelevancy, and GroundTruthMatch for closed-form questions, plus SummaryQuality and IsFactuallyConsistent for generation tasks. Each row is scored, the run is versioned, and the result is diffable against prior runs. When a new model lands, the same suite runs as a RegressionEval so the team sees per-task deltas. extraction up 4 points, multi-hop QA down 7. instead of a single moved-up-or-down number.

For live traffic, the same evaluators run online via traceAI: every span carrying an LLM output on a finance route is sampled, scored, and fed into an eval-fail-rate-by-cohort dashboard sliced by document type, customer tier, and model variant. FutureAGI’s approach is to treat FinBen as a starting kit, not a finish line. the public tasks pin the model floor; the private regression suite proves it works on your portfolio.

FinBen task families and what they expose

Task familyWhat it testsFrontier difficulty in 2026
Information extraction10-K, 10-Q fieldsMostly solved, but unit/currency errors persist
QAMulti-doc finance Q&AMid-difficulty; long-context retrieval matters
GenerationSummaries, briefsFaithfulness regressions easy to miss
Risk classificationCredit, fraud labelsSmall-class imbalance, F1 matters
ForecastingTime-series numericHardest tier. frontier still weak
Decision-makingAllocation, hedge picksStrongly affected by hallucinations

How to measure or detect FinBen

FinBen-style measurement is a stack of evaluators over a fixed task split:

  • FactualAccuracy. returns 0–1 plus a reason for whether a model’s claim matches the ground-truth filing or numeric answer.
  • GroundTruthMatch. boolean exact-match against canonical reference answers in extraction tasks.
  • AnswerRelevancy. measures whether the response actually addresses the finance question rather than hedging.
  • SummaryQuality. rubric score for earnings-call and report summarization.
  • IsFactuallyConsistent. NLI-based check between summary and source filing.
  • Per-task accuracy (dashboard signal). track each FinBen subtask separately; aggregate scores hide regressions in narrow but high-stakes tasks.
from fi.evals import FactualAccuracy, AnswerRelevancy

fact = FactualAccuracy()
result = fact.evaluate(
    input="What was Q3 2024 GAAP EPS?",
    output="Q3 GAAP EPS was $1.42.",
    expected_response="$1.42",
)
print(result.score, result.reason)

Common mistakes

  • Treating the FinBen aggregate score as your model verdict. Aggregate numbers hide which finance subtask broke; track per-task and per-cohort deltas.
  • Skipping the Chinese subset because your product is English-only. Multilingual gaps often correlate with weaker numeric reasoning even on English text.
  • Benchmarking on FinBen and shipping without private evals. Public benchmarks leak into model training; pair FinBen with a held-out internal golden dataset.
  • Comparing models across different prompt templates. Hold the prompt constant or report results per-template. FinBen score is prompt-sensitive.
  • Ignoring forecasting tasks because they look hard. Forecasting subtasks expose calibration weaknesses that hurt downstream risk products.

In our 2026 evals across finance workloads, the cohort that exposes the biggest model gap is forecasting: GPT-5.x, Claude Opus 4.7, and Gemini 3 Pro produce surface-credible narratives but still get directional accuracy wrong roughly a third of the time when probed with FinBen-style time-series prompts. FactualConsistency and judge-model rubrics with G-Eval-style scoring help separate “well-written” from “actually right” on those tasks.

Frequently Asked Questions

What is FinBen?

FinBen is a public LLM benchmark focused on financial tasks. extraction, QA, generation, risk, forecasting, and decision-making. designed to evaluate how well language models handle finance-specific reasoning over English and Chinese corpora.

How is FinBen different from general benchmarks like MMLU?

MMLU samples encyclopedic knowledge across 57 subjects; FinBen drills into finance-specific reading, calculation, and decision tasks. A model can be strong on MMLU and weak on 10-K extraction or credit-risk classification, and vice versa.

How do you run FinBen-style evaluations in production?

FutureAGI lets you load a FinBen-shaped Dataset, attach evaluators like FactualAccuracy, AnswerRelevancy, and GroundTruthMatch via Dataset.add_evaluation, and re-run the suite as a regression eval on every model or prompt change.