Models

What Is Model Selection?

The process of choosing the best model from a set of candidates based on task-specific evaluation metrics, cost, and latency.

What Is Model Selection?

Model selection is the engineering decision of choosing the best model — among algorithms, foundation models, fine-tunes, or prompt variants — for a specific task. In classical ML it relies on cross-validation across hyperparameter and model-class candidates. In LLM applications it relies on running the same task-specific eval suite against every candidate on a fixed golden dataset, then weighing the eval delta against cost-per-trace and p99 latency. The output is a single shipped model plus a documented rationale that future regression evals and FutureAGI dashboards can challenge.

Why It Matters in Production LLM and Agent Systems

The wrong model choice is one of the easiest, most expensive failures to ship. A team picks GPT-4o because the demo looked good, deploys it across 10M traces a month, and discovers Sonnet would have done the same job at a third the cost — but switching now means retesting every prompt, every guardrail, every JSON schema. Or the reverse: a team picks the cheap model, ships, and watches eval-fail-rate-by-cohort climb on the long tail of complex inputs. The selection decision compounds over millions of requests.

The pain shows up across roles. A backend engineer pages out at 3am on a latency incident caused by a model swap nobody benchmarked at p99. A finance lead sees the LLM bill quintuple after an “upgrade”. A product manager ships a feature that scores brilliantly in a sandbox demo and silently degrades on real traffic. Without a structured selection process — same evals, same dataset, same cost lens — these decisions are made on vibes.

In 2026-era stacks the candidate space has exploded. Open-source models (Llama, Mistral), API frontier models (Anthropic, OpenAI, Google), specialised fine-tunes, and prompt variants all compete for the same workload. Selection is no longer a one-shot decision — it runs continuously, often per-route, with the cheapest-but-good-enough model picked dynamically by a router. That makes structured, repeatable model selection a core MLOps capability, not a kickoff-week artifact.

How FutureAGI Handles Model Selection

FutureAGI’s approach to model selection treats every candidate as a row in the same eval matrix. Same dataset: you load a Dataset of representative inputs (golden set + sampled production traces), and use Dataset.add_evaluation() to attach the task-specific evaluators — TaskCompletion, Groundedness, AnswerRelevancy, FactualAccuracy, JsonValidation. Same evaluators: every candidate model runs against the same evaluator stack, so scores are directly comparable. Same cost lens: traceAI captures llm.token_count.prompt, llm.token_count.completion, and end-to-end latency on each run, letting you build a single table of eval-score × cost-per-trace × p99-latency per candidate. Routing handoff: once selection settles, the chosen model goes into the Agent Command Center as a routing policy — cost-optimized or quality-first — with model fallback configured to a backup if the primary fails or hits a quota.

Concretely: a customer-support team is choosing between three models for a tier-1 reply assistant. They run a 500-row golden dataset against each, score with AnswerRelevancy and Groundedness, and capture cost and latency. The cheapest model scores 0.81 vs the frontier’s 0.89, but the cost gap is 6x and the p99 latency is 40% lower. They route 80% of traffic to the cheap model and route the long-tail high-stakes intents to the frontier — selection becomes a dynamic per-route decision wired through FutureAGI’s gateway, not a static pick.

How to Measure or Detect It

Model selection signals span quality, cost, and latency — pick a balanced set:

  • Eval-score-by-candidate (dashboard): a single table of evaluator scores per model on the same dataset.
  • TaskCompletion: fi.evals.TaskCompletion returns 0–1 for whether each candidate finished the task.
  • Groundedness: fi.evals.Groundedness returns 0–1 anchored to retrieved context — the canonical RAG quality metric.
  • AnswerRelevancy: fi.evals.AnswerRelevancy measures how directly the output answers the input.
  • Cost-per-trace (OTel): derived from llm.token_count.prompt + llm.token_count.completion and per-model prices.
  • p99 latency (OTel): wall-clock latency at the 99th percentile from traceAI spans.

Minimal Python:

from fi.evals import TaskCompletion, Groundedness

candidates = ["gpt-4o", "claude-sonnet-4.6", "llama-3.1-70b"]
results = {}
for m in candidates:
    results[m] = {
        "task": TaskCompletion().evaluate(dataset=ds, model=m).mean,
        "ground": Groundedness().evaluate(dataset=ds, model=m).mean,
    }
print(results)

Common Mistakes

  • Selecting on a benchmark, not your task. MMLU and MT-Bench rankings rarely match what your users actually do — always evaluate on a dataset that mirrors production.
  • Ignoring cost and latency. A model that scores 2% higher and costs 5x is rarely the right pick at scale; balance the trade-off explicitly.
  • No regression eval after selection. The chosen model needs an ongoing eval cohort; “we picked it once” is not a quality strategy.
  • Picking based on a single demo prompt. Demo prompts are unrepresentative; require N >= 200 dataset rows before any selection decision.
  • Skipping model fallback configuration. Without a fallback, the moment your selected model rate-limits or breaks, the whole product breaks too.

Frequently Asked Questions

What is model selection?

Model selection is choosing the best model from a candidate set for a given task by running task-specific evals and weighing accuracy, cost, and latency.

How is model selection different from hyperparameter tuning?

Hyperparameter tuning optimises one model's configuration. Model selection chooses across distinct models or model families. The selection step often consumes the output of tuning runs as candidates.

How do you do model selection for LLM applications?

Run the same task-specific eval suite — TaskCompletion, Groundedness, AnswerRelevancy — against every candidate on the same golden dataset, then weight scores against cost-per-trace and p99 latency.