News

Future AGI and Portkey in 2026: How to Pair an LLM Gateway with End-to-End Evaluation

Future AGI x Portkey in 2026. Combine Portkey routing and 250+ model fallback with Future AGI traceAI eval scores. Setup in 5 minutes with Python.

June 25, 2025

Updated May 14, 2026

6 min read

agents company news integrations

Future AGI and Portkey in 2026: How to Pair an LLM Gateway with End-to-End Evaluation

LLM gateways own routing, retries, and cost. Eval and observability platforms own quality. In 2026 you almost always want both, and you want them stitched together so one trace tells the full story of a request. This post explains how to wire Future AGI’s traceAI plus continuous evaluators onto Portkey’s gateway in about five minutes of Python.

TL;DR: What This Integration Does

Layer	Owner	What it gives you
Evaluation and observability	Future AGI	traceAI OpenTelemetry spans, faithfulness / context adherence / toxicity / custom LLM-as-judge scores, prompt optimisation, simulation
Gateway and routing	Portkey	Unified API across 250+ models, retries, fallbacks, semantic cache, virtual keys, budget caps, guardrails
Glue	`traceai-portkey`	Auto-instruments Portkey calls and forwards request, response, and routing metadata to Future AGI’s eval engine

In one trace you see: which provider answered, whether a fallback fired, what it cost, how long it took, and what the evaluators thought of the answer.

Why Pair an Eval Layer with a Gateway in 2026

Production LLM applications in 2026 typically have these moving parts:

A routing decision (cheap model for short prompts, frontier model for long, fallback when a provider is degraded)
A retry and fallback policy across providers
Cost tracking by team, route, and feature
Quality scoring on the output (groundedness, context adherence, policy compliance)
A trace per request that ties all of the above together

A gateway like Portkey solves the first three. An eval and observability platform like Future AGI solves the next two. Running them in isolation means every quality regression triggers a manual cross-system investigation: was it the model, the prompt, the retrieval, or a provider degradation that caused the score drop?

The integration removes that hop. Portkey’s routing metadata becomes span attributes in Future AGI, so a single dashboard answers both “which provider answered?” and “how good was the answer?”.

What Each Side Owns

Future AGI

Future AGI is an evaluation and observability platform. The pieces you use here:

traceAI (Apache 2.0): an OpenTelemetry-compatible Python SDK for emitting spans from LLM calls, tool calls, retrievals, and agent steps.
fi.evals: cloud and self-hosted evaluators. Built-in evaluators include faithfulness, context adherence, toxicity, summary quality, and several agent-specific metrics. Eval models include turing_flash (about 1 to 2 seconds), turing_small (about 2 to 3 seconds), and turing_large (about 3 to 5 seconds).
Custom LLM-as-judge via fi.evals.metrics.CustomLLMJudge, including support for fi.evals.llm.LiteLLMProvider so you can run judges on any model.
Agent Command Center at /platform/monitor/command-center: a BYOK gateway alternative for teams that want to consolidate. This post is about the Portkey path.

Set FI_API_KEY and FI_SECRET_KEY to authenticate the SDK.

Portkey

Portkey is an AI gateway with broad provider coverage. The pieces that matter for this integration:

A unified API surface for 250+ models across OpenAI, Anthropic, Google, AWS Bedrock, Vertex AI, Groq, Together, self-hosted Ollama, and more.
Virtual keys that abstract provider credentials per team or route.
Automatic retries, fallback to a backup provider, and conditional routing.
Semantic and simple caching to drop spend on repeat prompts.
Per-route budget caps and observability dashboards on its own side.

How the Integration Works

The flow is standard OpenTelemetry plus a Portkey-specific instrumentor:

Your application calls Portkey via the portkey-ai client, exactly as it does today.
Portkey routes the request, runs retries or fallbacks if configured, and returns the response.
The traceai-portkey instrumentor sees the call and packages the prompt, response, provider, latency, cost, fallback chain, and any cache metadata into an OpenTelemetry span.
The span is sent to Future AGI, where EvalTag rules attached at register() time decide which evaluators to run on the response.
The Future AGI dashboard renders the trace with both routing metadata and quality scores side by side.

Nothing in your business logic changes. The instrumentor patches the Portkey client when you call PortkeyInstrumentor().instrument(...) during startup.

Quick Setup in Python

Step 1: Get Your Keys

Future AGI: sign in at futureagi.com, open the project settings, and copy FI_API_KEY and FI_SECRET_KEY.
Portkey: sign in at portkey.ai, create a virtual key for the provider you want to route through, and copy it.

Store both in a local .env file.

Step 2: Install the Packages

pip install portkey-ai fi-instrumentation traceai-portkey python-dotenv

Step 3: Wire Tracing and Run

from dotenv import load_dotenv
from portkey_ai import Portkey
from traceai_portkey import PortkeyInstrumentor
from fi_instrumentation import register
from fi_instrumentation.fi_types import (
    EvalTag,
    EvalTagType,
    EvalSpanKind,
    EvalName,
)

load_dotenv()

tracer_provider = register(
    project_name="My-AI-App",
    eval_tags=[
        EvalTag(
            type=EvalTagType.OBSERVATION_SPAN,
            value=EvalSpanKind.LLM,
            eval_name=EvalName.CONTEXT_ADHERENCE,
            custom_eval_name="Response_Quality",
        )
    ],
)

PortkeyInstrumentor().instrument(tracer_provider=tracer_provider)

client = Portkey(virtual_key="your-portkey-virtual-key")

completion = client.chat.completions.create(
    model="gpt-5-2025-08-07",
    messages=[
        {"role": "user", "content": "Write a 6-word story about a robot who discovers music."}
    ],
)

print(completion.choices[0].message.content)

Run the script. The Future AGI dashboard now shows a trace per request with the Portkey routing metadata and the context-adherence score. The Portkey dashboard continues to show the operational view (provider, latency, cost, cache hits) unchanged.

Step 4: Add More Evaluators

The minimal setup above scores context adherence. To add more, append additional EvalTag entries to the register() call. Built-in eval names include faithfulness, toxicity, summary quality, and several others; the catalog lives in the Future AGI docs.

For task-specific rubrics, register a CustomLLMJudge:

from fi.evals.metrics import CustomLLMJudge
from fi.evals.llm import LiteLLMProvider

policy_judge = CustomLLMJudge(
    name="policy_compliance",
    rule="The response must not promise refunds or discounts.",
    provider=LiteLLMProvider(model="gpt-5-2025-08-07"),
)

You can then call policy_judge.run(output=...) from your own pipeline, or wire it as an evaluator in the Future AGI dashboard.

What You Get End-to-End

Per request, a single Future AGI trace now carries:

Prompt and response (with redaction if you have PII rules turned on).
Provider and model pin that actually answered.
Fallback and retry chain when Portkey rerouted around a degraded provider.
Latency and cost for the chosen provider.
Cache status (semantic or exact-match) when Portkey returned a cached answer.
Quality scores from each evaluator attached to the project.
OpenTelemetry context that links the LLM call to upstream and downstream spans in your application.

For agentic systems, the same traces capture tool calls and retrievals (because they go through fi_instrumentation), so a multi-step trace shows the full path from user input to final answer with quality scores at each step.

Common Patterns Once You Have Both

Catch Provider Degradation Before Users Do

Set an alert in Future AGI on rolling p95 context-adherence drops, grouped by Portkey’s actual_provider attribute. When a provider starts answering noticeably worse on the same prompts, the alert fires before your CS team gets the tickets.

Track Cost Per Quality Point

Pair the Portkey cost attribute with the Future AGI quality score in a custom dashboard. A model that costs 80 percent less but scores 5 percent lower is often the right default; the dashboard lets you decide.

Block Bad Outputs Inline

If you need hard blocks (PII, jailbreaks, policy violations), call an evaluator synchronously from your application code (for example evaluate("toxicity", output=response, model="turing_flash") from fi.evals, or a CustomLLMJudge.run(...) call) and decide whether to return the response based on the score. Asynchronous scoring on traces is the default; switch to inline only where the failure mode justifies the added latency.

Replay Bad Traces in Simulation

Once a low-scoring trace lands in Future AGI, you can replay it via fi.simulate.TestRunner with AgentInput and AgentResponse to reproduce the issue against a candidate prompt or model swap before promoting a fix.

Documentation and Source

Integration cookbook: docs.futureagi.com/cookbook/cookbook11/integrate-portkey-and-futureagi
traceai-portkey source (Apache 2.0): github.com/future-agi/traceAI
ai-evaluation SDK (Apache 2.0): github.com/future-agi/ai-evaluation
Portkey docs: portkey.ai/docs
OpenTelemetry semantic conventions: opentelemetry.io

Closing Notes

Pairing a gateway with an eval and observability layer is the production default in 2026. Portkey owns the routing concern across many providers; Future AGI owns the quality and tracing concern across many evaluators. The traceai-portkey instrumentor is the thin glue that makes one trace tell both stories.

If you would rather consolidate to a single vendor, the Future AGI Agent Command Center is the BYOK gateway path inside the same platform as the eval engine. Either combination is supportable in 2026; the only configuration we would not recommend is direct provider SDK calls inside business logic with no gateway and no traces.

Frequently asked questions

Do I need accounts on both Future AGI and Portkey to use this integration?

Yes. Portkey provides the gateway layer (routing across 250+ models, retries, fallback, semantic cache, virtual keys, budget caps). Future AGI provides the evaluation and observability layer (traceAI OpenTelemetry spans, cloud and self-hosted evaluators, continuous quality scoring). The integration sends Portkey's operational metadata into Future AGI traces so a single span carries both routing context and quality scores. Either platform works standalone, but pairing them is what gives you end-to-end visibility.

Does the integration cost extra on top of Future AGI and Portkey?

No. The integration is the open-source traceai-portkey package plus the standard Future AGI and Portkey SDKs. You pay only the usage tiers each platform already charges (LLM tokens routed through Portkey, eval runs and trace ingestion in Future AGI). There is no per-call gateway-side fee specific to the integration itself.

Does the integration add latency to production LLM requests?

Span emission is asynchronous and runs outside the request path, so the gateway call returns to the application before traces are flushed. Sync overhead is the OpenTelemetry context-propagation header, which is small relative to LLM call latency. Evaluators run on a sampled subset or async on the full stream, so they do not block responses unless you configure inline guardrails on purpose.

Which languages and frameworks does traceai-portkey support?

The traceai-portkey instrumentor is Python-first in 2026. Teams shipping in Node today typically wire OpenTelemetry directly from Portkey's response metadata into the Future AGI OTel endpoint until an official JS instrumentor is published; check the traceAI repo for the latest language support.

Should I use Future AGI's Agent Command Center instead of Portkey?

They overlap but solve different problems. The Future AGI Agent Command Center (at /platform/monitor/command-center) is a BYOK gateway tightly integrated with eval and guardrail policies. Portkey is a mature, standalone gateway with broader provider coverage and routing controls. Many teams adopt Portkey for the gateway and Future AGI for the eval and observability layer; some adopt the Agent Command Center instead to consolidate. The integration described in this post covers the Portkey-as-gateway path.

What quality evaluators run by default once the integration is wired up?

No evaluator runs unless you configure one. You attach evaluators by passing EvalTag entries into the register() call; the cookbook example uses context adherence on observation spans of kind LLM. From there you can add any cloud evaluator (faithfulness, toxicity, groundedness, summary quality, PII detection) or wire a custom LLM-as-judge via fi.evals.metrics.CustomLLMJudge against your own rubric.

Can I correlate Portkey fallback events with quality drops in Future AGI?

Yes. Portkey's fallback, retry, and cache-hit metadata are emitted as span attributes through the instrumentor, so a Future AGI trace shows which provider actually answered, whether a fallback fired, and what quality score the response earned. That correlation is the main reason teams pair the two.

Is traceAI open source and what is the license?

Yes. traceAI is published under Apache 2.0 at github.com/future-agi/traceAI. The ai-evaluation SDK (the Python package that exposes fi.evals, fi.opt, fi.evals.metrics, and fi.evals.llm) is also Apache 2.0 at github.com/future-agi/ai-evaluation. You can self-host both or run them against the Future AGI cloud.

View all

Guide

Future AGI vs Confident AI in 2026: Which LLM Eval Wins?

Future AGI vs Confident AI (DeepEval) in 2026: multimodal eval, observability, OSS license, prompt-opt, and which one ships your AI app to production safely.

Vrinda Damani · May 14, 2025

8 min

Guide

Future AGI November 2025: Voice Persona Testing, A/B for STT-LLM-TTS

Discover Future AGI's November 2025 updates including voice agent persona testing, outbound call simulation, A/B testing for STT-LLM-TTS stacks, 30-plus.

Rishav Hada · Nov 30, 2025

6 min

Guide

Future AGI Protect 2026: Multi-Modal AI Guardrails

Future AGI Protect ships multi-modal guardrails for text, image, audio. Sub-100ms text latency, around 109ms image. Toxicity, bias, privacy, prompt injection.

Rishav Hada · Oct 21, 2025

8 min