Future AGI and Portkey in 2026: How to Pair an LLM Gateway with End-to-End Evaluation
Future AGI x Portkey in 2026. Combine Portkey routing and 250+ model fallback with Future AGI traceAI eval scores. Setup in 5 minutes with Python.
Table of Contents
Future AGI and Portkey in 2026: How to Pair an LLM Gateway with End-to-End Evaluation
LLM gateways own routing, retries, and cost. Eval and observability platforms own quality. In 2026 you almost always want both, and you want them stitched together so one trace tells the full story of a request. This post explains how to wire Future AGI’s traceAI plus continuous evaluators onto Portkey’s gateway in about five minutes of Python.
TL;DR: What This Integration Does
| Layer | Owner | What it gives you |
|---|---|---|
| Evaluation and observability | Future AGI | traceAI OpenTelemetry spans, faithfulness / context adherence / toxicity / custom LLM-as-judge scores, prompt optimisation, simulation |
| Gateway and routing | Portkey | Unified API across 250+ models, retries, fallbacks, semantic cache, virtual keys, budget caps, guardrails |
| Glue | traceai-portkey | Auto-instruments Portkey calls and forwards request, response, and routing metadata to Future AGI’s eval engine |
In one trace you see: which provider answered, whether a fallback fired, what it cost, how long it took, and what the evaluators thought of the answer.
Why Pair an Eval Layer with a Gateway in 2026
Production LLM applications in 2026 typically have these moving parts:
- A routing decision (cheap model for short prompts, frontier model for long, fallback when a provider is degraded)
- A retry and fallback policy across providers
- Cost tracking by team, route, and feature
- Quality scoring on the output (groundedness, context adherence, policy compliance)
- A trace per request that ties all of the above together
A gateway like Portkey solves the first three. An eval and observability platform like Future AGI solves the next two. Running them in isolation means every quality regression triggers a manual cross-system investigation: was it the model, the prompt, the retrieval, or a provider degradation that caused the score drop?
The integration removes that hop. Portkey’s routing metadata becomes span attributes in Future AGI, so a single dashboard answers both “which provider answered?” and “how good was the answer?”.
What Each Side Owns
Future AGI
Future AGI is an evaluation and observability platform. The pieces you use here:
- traceAI (Apache 2.0): an OpenTelemetry-compatible Python SDK for emitting spans from LLM calls, tool calls, retrievals, and agent steps.
fi.evals: cloud and self-hosted evaluators. Built-in evaluators include faithfulness, context adherence, toxicity, summary quality, and several agent-specific metrics. Eval models includeturing_flash(about 1 to 2 seconds),turing_small(about 2 to 3 seconds), andturing_large(about 3 to 5 seconds).- Custom LLM-as-judge via
fi.evals.metrics.CustomLLMJudge, including support forfi.evals.llm.LiteLLMProviderso you can run judges on any model. - Agent Command Center at
/platform/monitor/command-center: a BYOK gateway alternative for teams that want to consolidate. This post is about the Portkey path.
Set FI_API_KEY and FI_SECRET_KEY to authenticate the SDK.
Portkey
Portkey is an AI gateway with broad provider coverage. The pieces that matter for this integration:
- A unified API surface for 250+ models across OpenAI, Anthropic, Google, AWS Bedrock, Vertex AI, Groq, Together, self-hosted Ollama, and more.
- Virtual keys that abstract provider credentials per team or route.
- Automatic retries, fallback to a backup provider, and conditional routing.
- Semantic and simple caching to drop spend on repeat prompts.
- Per-route budget caps and observability dashboards on its own side.
How the Integration Works
The flow is standard OpenTelemetry plus a Portkey-specific instrumentor:
- Your application calls Portkey via the
portkey-aiclient, exactly as it does today. - Portkey routes the request, runs retries or fallbacks if configured, and returns the response.
- The
traceai-portkeyinstrumentor sees the call and packages the prompt, response, provider, latency, cost, fallback chain, and any cache metadata into an OpenTelemetry span. - The span is sent to Future AGI, where
EvalTagrules attached atregister()time decide which evaluators to run on the response. - The Future AGI dashboard renders the trace with both routing metadata and quality scores side by side.
Nothing in your business logic changes. The instrumentor patches the Portkey client when you call PortkeyInstrumentor().instrument(...) during startup.
Quick Setup in Python
Step 1: Get Your Keys
- Future AGI: sign in at futureagi.com, open the project settings, and copy
FI_API_KEYandFI_SECRET_KEY. - Portkey: sign in at portkey.ai, create a virtual key for the provider you want to route through, and copy it.
Store both in a local .env file.
Step 2: Install the Packages
pip install portkey-ai fi-instrumentation traceai-portkey python-dotenv
Step 3: Wire Tracing and Run
from dotenv import load_dotenv
from portkey_ai import Portkey
from traceai_portkey import PortkeyInstrumentor
from fi_instrumentation import register
from fi_instrumentation.fi_types import (
EvalTag,
EvalTagType,
EvalSpanKind,
EvalName,
)
load_dotenv()
tracer_provider = register(
project_name="My-AI-App",
eval_tags=[
EvalTag(
type=EvalTagType.OBSERVATION_SPAN,
value=EvalSpanKind.LLM,
eval_name=EvalName.CONTEXT_ADHERENCE,
custom_eval_name="Response_Quality",
)
],
)
PortkeyInstrumentor().instrument(tracer_provider=tracer_provider)
client = Portkey(virtual_key="your-portkey-virtual-key")
completion = client.chat.completions.create(
model="gpt-5-2025-08-07",
messages=[
{"role": "user", "content": "Write a 6-word story about a robot who discovers music."}
],
)
print(completion.choices[0].message.content)
Run the script. The Future AGI dashboard now shows a trace per request with the Portkey routing metadata and the context-adherence score. The Portkey dashboard continues to show the operational view (provider, latency, cost, cache hits) unchanged.
Step 4: Add More Evaluators
The minimal setup above scores context adherence. To add more, append additional EvalTag entries to the register() call. Built-in eval names include faithfulness, toxicity, summary quality, and several others; the catalog lives in the Future AGI docs.
For task-specific rubrics, register a CustomLLMJudge:
from fi.evals.metrics import CustomLLMJudge
from fi.evals.llm import LiteLLMProvider
policy_judge = CustomLLMJudge(
name="policy_compliance",
rule="The response must not promise refunds or discounts.",
provider=LiteLLMProvider(model="gpt-5-2025-08-07"),
)
You can then call policy_judge.run(output=...) from your own pipeline, or wire it as an evaluator in the Future AGI dashboard.
What You Get End-to-End
Per request, a single Future AGI trace now carries:
- Prompt and response (with redaction if you have PII rules turned on).
- Provider and model pin that actually answered.
- Fallback and retry chain when Portkey rerouted around a degraded provider.
- Latency and cost for the chosen provider.
- Cache status (semantic or exact-match) when Portkey returned a cached answer.
- Quality scores from each evaluator attached to the project.
- OpenTelemetry context that links the LLM call to upstream and downstream spans in your application.
For agentic systems, the same traces capture tool calls and retrievals (because they go through fi_instrumentation), so a multi-step trace shows the full path from user input to final answer with quality scores at each step.
Common Patterns Once You Have Both
Catch Provider Degradation Before Users Do
Set an alert in Future AGI on rolling p95 context-adherence drops, grouped by Portkey’s actual_provider attribute. When a provider starts answering noticeably worse on the same prompts, the alert fires before your CS team gets the tickets.
Track Cost Per Quality Point
Pair the Portkey cost attribute with the Future AGI quality score in a custom dashboard. A model that costs 80 percent less but scores 5 percent lower is often the right default; the dashboard lets you decide.
Block Bad Outputs Inline
If you need hard blocks (PII, jailbreaks, policy violations), call an evaluator synchronously from your application code (for example evaluate("toxicity", output=response, model="turing_flash") from fi.evals, or a CustomLLMJudge.run(...) call) and decide whether to return the response based on the score. Asynchronous scoring on traces is the default; switch to inline only where the failure mode justifies the added latency.
Replay Bad Traces in Simulation
Once a low-scoring trace lands in Future AGI, you can replay it via fi.simulate.TestRunner with AgentInput and AgentResponse to reproduce the issue against a candidate prompt or model swap before promoting a fix.
Documentation and Source
- Integration cookbook:
docs.futureagi.com/cookbook/cookbook11/integrate-portkey-and-futureagi traceai-portkeysource (Apache 2.0): github.com/future-agi/traceAIai-evaluationSDK (Apache 2.0): github.com/future-agi/ai-evaluation- Portkey docs: portkey.ai/docs
- OpenTelemetry semantic conventions: opentelemetry.io
Closing Notes
Pairing a gateway with an eval and observability layer is the production default in 2026. Portkey owns the routing concern across many providers; Future AGI owns the quality and tracing concern across many evaluators. The traceai-portkey instrumentor is the thin glue that makes one trace tell both stories.
If you would rather consolidate to a single vendor, the Future AGI Agent Command Center is the BYOK gateway path inside the same platform as the eval engine. Either combination is supportable in 2026; the only configuration we would not recommend is direct provider SDK calls inside business logic with no gateway and no traces.
Frequently asked questions
Do I need accounts on both Future AGI and Portkey to use this integration?
Does the integration cost extra on top of Future AGI and Portkey?
Does the integration add latency to production LLM requests?
Which languages and frameworks does traceai-portkey support?
Should I use Future AGI's Agent Command Center instead of Portkey?
What quality evaluators run by default once the integration is wired up?
Can I correlate Portkey fallback events with quality drops in Future AGI?
Is traceAI open source and what is the license?
Future AGI vs Confident AI (DeepEval) in 2026: multimodal eval, observability, OSS license, prompt-opt, and which one ships your AI app to production safely.
Discover Future AGI's November 2025 updates including voice agent persona testing, outbound call simulation, A/B testing for STT-LLM-TTS stacks, 30-plus.
Future AGI Protect ships multi-modal guardrails for text, image, audio. Sub-100ms text latency, around 109ms image. Toxicity, bias, privacy, prompt injection.