How is a model registry different from a model store?

A model store usually stores artifacts or downloadable model packages. A gateway model registry focuses on runtime metadata: provider, model name, health, cost, capabilities, fallback eligibility, and routing policy compatibility.

How do you measure a model registry?

Measure registry health through trace fields such as gen_ai.request.model, llm.token_count.prompt, model error rate, fallback-trigger rate, cost per model, and eval-fail-rate by registered model.

What Is a Model Registry? FutureAGI Guide (2026)

What Is a Model Registry?

A model registry is the controlled inventory of approved models, providers, versions, capabilities, and routing metadata used by an LLM gateway. It is a gateway control-plane concept, not just a training artifact catalog. In production traces, it shows up as the requested model, resolved provider target, cost class, health status, and fallback eligibility. FutureAGI uses this registry idea inside Agent Command Center so teams can route, monitor, test, and retire model targets without scattering model names through application code.

Why model registries matter in production LLM and agent systems

The first failure is usually quiet drift. A team ships gpt-4o-mini in one service, claude-sonnet-4 in another, and a stale vendor alias in a worker. Six weeks later, no one knows which model answered a regulated customer, which fallback fired, or why token spend doubled.

A gateway model registry gives platform engineers a single contract for model identity. Without it, production systems develop three costly symptoms:

Untracked model changes. A provider alias changes behavior, but traces only show an old string copied from app code.
Unsafe fallback chains. A low-cost model becomes the backup for a high-risk workflow because nobody encoded capability or policy constraints.
Messy cost attribution. Finance sees spend by provider key, while developers debug failures by model name and product owners care about task outcome.

Agentic systems make this worse because one user task can call a planner model, a tool-use model, an embedding model, and a verifier. If those targets are not registered consistently, retries, guardrails, and evaluator regressions cannot be compared across the full path. The pain lands on developers during incident review, SREs during provider outages, compliance teams during audits, and users when a degraded model silently handles a task it should never receive.

How FutureAGI handles a model registry

FutureAGI anchors this concept to gateway:models, the Agent Command Center models surface. The registry is the set of model targets the gateway knows how to call, plus the metadata needed to route and evaluate them: provider family, canonical model ID, supported task type, cost tier, context window, streaming support, timeout policy, health state, and whether the target can participate in model fallback, traffic-mirroring, semantic-cache, pre-guardrail, or post-guardrail flows.

Consider a support agent with three registered targets:

openai:gpt-4o for high-risk final answers.
anthropic:claude-sonnet-4 for mirrored comparison traffic.
openai:gpt-4o-mini for low-risk classification and summarization.

The engineer does not hard-code those names in every service. They attach a routing-policy to the workflow, mark gpt-4o-mini as ineligible for regulated final answers, and mirror 5% of gpt-4o traffic to Claude for an eval run. FutureAGI’s approach is to treat model identity as observable runtime state: each trace carries gen_ai.request.model, token counts such as llm.token_count.prompt, the resolved route, and evaluation outcomes.

Unlike MLflow Model Registry, which is centered on artifact lifecycle, or a plain LiteLLM alias map, Agent Command Center connects registered models to gateway decisions. An engineer can see that a cost-optimized routing policy moved a cohort to a cheaper model, then check Groundedness, TaskCompletion, or user escalation rate before promoting that route.

How to measure or detect a model registry

Measure the registry by checking whether model identity is complete, traceable, and tied to outcomes:

Registry coverage: percentage of production calls whose requested model resolves to a registered target. Alert on unknown model strings.
Trace fields: gen_ai.request.model, resolved provider, route ID, and llm.token_count.prompt should appear on every model span.
Health by target: 429 rate, 5xx rate, timeout rate, fallback-trigger rate, and p99 latency per registered model.
Cost by model: token cost per trace and cost per successful task, grouped by model and routing policy.
Quality by model: eval-fail-rate by target using evaluators such as Groundedness, TaskCompletion, or PromptInjection for the relevant workflow.

from fi.evals import Groundedness

evaluator = Groundedness()
score = evaluator.evaluate(
    input=prompt,
    output=response,
    context=retrieved_context,
)

The registry is healthy when a model rollout can answer three questions quickly: which traffic used the model, what it cost, and whether quality moved outside threshold.

Common mistakes

Treating a registry as a static spreadsheet. Runtime systems need health, cost, capability, and policy metadata that update with provider behavior.
Registering provider aliases instead of canonical targets. Aliases hide model upgrades and make regression evals hard to interpret.
Letting any fallback target serve any workflow. Capability tags should block cheap classifiers from handling regulated final answers.
Separating model registry and trace data. If traces lack model identity, incident reviews become guesswork.
Measuring only latency and cost. A model that is fast and cheap can still fail Groundedness or TaskCompletion thresholds.