Evaluation

What Is FinBen Domain-Specific Benchmark?

An open finance-domain LLM benchmark covering extraction, analysis, QA, generation, risk, forecasting, and decision-making tasks across English and Chinese corpora.

What Is FinBen Domain-Specific Benchmark?

FinBen is a public domain-specific benchmark for evaluating large language models on finance tasks. It bundles datasets across information extraction, textual analysis, question answering, text generation, risk management, forecasting, and decision-making in both English and Chinese. Released in 2024 by the FinNLP / PIXIU community, it gives finance-focused teams a unified yardstick when general benchmarks like MMLU, MT-Bench, or GSM8K fail to distinguish whether a model can actually read a 10-K, classify credit risk, or summarize an earnings call. In production stacks it is used as a model-selection filter, not the final word on quality.

Why It Matters in Production LLM and Agent Systems

A model that scores 88 on MMLU can still get an earnings-call summary catastrophically wrong — confusing GAAP vs. non-GAAP numbers, dropping a guidance revision, or hallucinating a segment that does not exist in the transcript. Generic benchmarks miss this because they do not test finance-specific reading, table parsing, currency math, or regulatory framing. FinBen exists to close that gap with concrete, reproducible tasks pulled from financial corpora.

The pain shows up across roles. A platform engineer onboards a new “frontier” model and finds that summarizations of 10-Q sections silently lose footnote disclosures. An ML lead picks a cheaper open-weight model based on chatbot-arena rank and watches credit-classification accuracy drop 12 points in offline tests. A compliance owner is asked, mid-audit, “how did you choose this model for risk-flagging?” — and FinBen-style benchmark numbers, paired with private evals, are the cleanest answer.

For 2026 agent stacks that pull SEC filings, internal databases, and live market data through tool calls, the finance-task floor matters more than overall fluency. A FinBen result tells you whether the reasoning core handles the domain at all; trajectory-level evals over your real prompts, retrievers, and guardrails tell you whether the agent works for your users.

How FutureAGI Handles FinBen-Style Evaluation

FutureAGI does not host the FinBen leaderboard; we sit one layer below it, where teams run private FinBen-shaped evals against their own data. The workflow uses three FAGI surfaces: a versioned Dataset, the Dataset.add_evaluation() interface, and fi.evals evaluators wired to OpenTelemetry traces.

Concretely: a fintech team imports the FinBen QA subset (or a private extension built from their own filings and analyst notes) as a Dataset, then attaches FactualAccuracy, AnswerRelevancy, and GroundTruthMatch for closed-form questions, plus SummaryQuality and IsFactuallyConsistent for generation tasks. Each row is scored, the run is versioned, and the result is diffable against prior runs. When a new model lands, the same suite runs as a RegressionEval so the team sees per-task deltas — extraction up 4 points, multi-hop QA down 7 — instead of a single moved-up-or-down number.

For live traffic, the same evaluators run online via traceAI: every span carrying an LLM output on a finance route is sampled, scored, and fed into an eval-fail-rate-by-cohort dashboard sliced by document type, customer tier, and model variant. FutureAGI’s approach is to treat FinBen as a starting kit, not a finish line — the public tasks pin the model floor; the private regression suite proves it works on your portfolio.

How to Measure or Detect It

FinBen-style measurement is a stack of evaluators over a fixed task split:

  • FactualAccuracy — returns 0–1 plus a reason for whether a model’s claim matches the ground-truth filing or numeric answer.
  • GroundTruthMatch — boolean exact-match against canonical reference answers in extraction tasks.
  • AnswerRelevancy — measures whether the response actually addresses the finance question rather than hedging.
  • SummaryQuality — rubric score for earnings-call and report summarization.
  • IsFactuallyConsistent — NLI-based check between summary and source filing.
  • Per-task accuracy (dashboard signal) — track each FinBen subtask separately; aggregate scores hide regressions in narrow but high-stakes tasks.
from fi.evals import FactualAccuracy, AnswerRelevancy

fact = FactualAccuracy()
result = fact.evaluate(
    input="What was Q3 2024 GAAP EPS?",
    output="Q3 GAAP EPS was $1.42.",
    expected_response="$1.42",
)
print(result.score, result.reason)

Common Mistakes

  • Treating the FinBen aggregate score as your model verdict. Aggregate numbers hide which finance subtask broke; track per-task and per-cohort deltas.
  • Skipping the Chinese subset because your product is English-only. Multilingual gaps often correlate with weaker numeric reasoning even on English text.
  • Benchmarking on FinBen and shipping without private evals. Public benchmarks leak into model training; pair FinBen with a held-out internal dataset.
  • Comparing models across different prompt templates. Hold the prompt constant or report results per-template — FinBen score is prompt-sensitive.
  • Ignoring forecasting tasks because they look hard. Forecasting subtasks expose calibration weaknesses that hurt downstream risk products.

Frequently Asked Questions

What is FinBen?

FinBen is a public LLM benchmark focused on financial tasks — extraction, QA, generation, risk, forecasting, and decision-making — designed to evaluate how well language models handle finance-specific reasoning over English and Chinese corpora.

How is FinBen different from general benchmarks like MMLU?

MMLU samples encyclopedic knowledge across 57 subjects; FinBen drills into finance-specific reading, calculation, and decision tasks. A model can be strong on MMLU and weak on 10-K extraction or credit-risk classification, and vice versa.

How do you run FinBen-style evaluations in production?

FutureAGI lets you load a FinBen-shaped Dataset, attach evaluators like FactualAccuracy, AnswerRelevancy, and GroundTruthMatch via Dataset.add_evaluation, and re-run the suite as a regression eval on every model or prompt change.