Guides

OpenAI AgentKit + Future AGI in 2026: Build, Trace, and Evaluate Production Agents End to End

OpenAI AgentKit (Oct 2025) + Future AGI in 2026: visual builder, traceAI auto-instrumentation, fi.evals scoring, BYOK gateway. Real code, real APIs, no hype.

November 24, 2025

Updated May 14, 2026

6 min read

agents evaluations llms openai

OpenAI AgentKit + Future AGI in 2026: Reliable Production Agents

OpenAI shipped AgentKit on October 6, 2025 to give teams a visual builder, embeddable UI, Python and TypeScript SDKs, and a connector registry for building LLM agents (openai.com/index/introducing-agentkit). Future AGI sits on top of that stack as the reliability layer: auto-instrumentation via the Apache 2.0 traceAI library, evaluation via the Apache 2.0 ai-evaluation SDK, and a BYOK Agent Command Center gateway. This guide walks through the integration, ships runnable code, and shows where each tool earns its keep.

TL;DR

Layer	OpenAI AgentKit	Future AGI
Build	Agent Builder visual canvas, Agents SDK, Connector Registry, ChatKit	Prompt-opt loop, synthetic dataset generator
Run	Responses API runtime	BYOK Agent Command Center gateway at `/platform/monitor/command-center`
Trace	Built-in trace viewer for AgentKit runs	`traceAI-openai-agents` (Apache 2.0) emits OpenTelemetry spans
Evaluate	AgentKit Evals: dataset builder, trace grading, prompt optimizer	`fi.evals.evaluate()` cloud catalog with `turing_flash` (1-2s), `turing_small` (2-3s), `turing_large` (3-5s)
Improve	Auto prompt-rewrite inside Builder	Closed-loop prompt-opt fed by production eval signals
Monitor	Console dashboards for AgentKit-hosted runs	Continuous online evals, alerts, dashboards

Why agents fail in production and how to see it

A typical AgentKit workflow has three failure surfaces:

Reasoning drift. The agent picks the wrong branch on the Agent Builder canvas.
Tool drift. A connector returns a 200 with garbage, and the agent treats it as authoritative.
Cost and latency drift. A loop keeps the model busy at 1.5x the budgeted spend.

Each is a visibility problem before it is a code problem. AgentKit’s built-in trace viewer is enough to debug a single run; production needs continuous traces, eval scores on every run, and alerts when scores regress. That is the gap Future AGI fills.

OpenAI AgentKit: components in one paragraph each

Agent Builder

Visual canvas with state-machine semantics. Each node is a state, each edge is a handoff. The runtime is the Responses API. Versioning, branching, and inline evals are first-class. Use Agent Builder when the workflow shape is predictable (customer support, internal automation) or when non-engineers need to edit flows.

Connector Registry

Centralized service for tools, data sources, and MCP servers. Each connector carries its own OAuth or API-key credentials and access scope. The registry is the central audit point for outbound calls. MCP support lets agents talk to third-party services without custom integrations (modelcontextprotocol.io).

ChatKit

Embeddable chat UI with streaming, thread persistence, file upload, and a Python SDK for backend hooks. The UI shows reasoning steps and live tool usage. Use ChatKit when you want a polished interaction surface without owning the front-end stack.

Agents SDK (Python + TypeScript)

Code-first definition of agents, tools, and orchestration. Shares the Responses API runtime with Agent Builder, so flows can be authored in either surface. Use the SDK when logic is custom, when CI/CD pipelines need agent definitions in version control, or when you want to instrument runs programmatically.

Built-in AgentKit Evals

Four capabilities: dataset builder for in-workflow test sets, trace grading for end-to-end run quality, prompt optimization for prompt rewrites, and third-party model scoring. Strong for the build phase, less suited to continuous evaluation on live traffic. Future AGI covers the continuous side.

Future AGI: the production reliability layer

Future AGI exposes three open-source surfaces plus a managed cloud catalog.

traceAI (pip install traceAI-openai-agents): Apache 2.0 monorepo of OpenTelemetry instrumentors. The OpenAI Agents SDK instrumentor wraps Runner.run, tool_call, handoff, and LLM call spans automatically.
ai-evaluation (pip install ai-evaluation): Apache 2.0 SDK exposing from fi.evals import evaluate, Evaluator, CustomLLMJudge for custom rubrics, and LiteLLMProvider for judge-model routing.
fi_instrumentation: helper that calls register() to create a tracer provider linked to your FI_API_KEY and FI_SECRET_KEY.
Agent Command Center: BYOK gateway at /platform/monitor/command-center that handles multi-provider routing, guardrails, and budget controls.

Five-step integration walkthrough

Step 1: build the workflow in Agent Builder or the Agents SDK

Define agents, prompts, and tools in Agent Builder (visual) or the Agents SDK (code). Wire connectors through the Connector Registry rather than hard-coding API calls.

Step 2: install traceAI and register a tracer

The traceAI-openai-agents package auto-instruments the OpenAI Agents SDK runtime. The snippet below registers a tracer, then enables instrumentation.

import os

from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_openai_agents import OpenAIAgentsInstrumentor

os.environ["FI_API_KEY"] = "your-future-agi-api-key"
os.environ["FI_SECRET_KEY"] = "your-future-agi-secret-key"

trace_provider = register(
    project_type=ProjectType.OBSERVE,
    project_name="sales_research_agent",
)

OpenAIAgentsInstrumentor().instrument(
    tracer_provider=trace_provider,
)

The instrumentor is published in the Apache 2.0 traceAI repo under python/instrumentation/. Every run, handoff, tool call, and LLM completion now streams to the Future AGI dashboard as OpenTelemetry spans.

Step 3: run the agent

After instrumentation is registered, run the agent the same way you would without it. No further code changes are needed.

from agents import Agent, Runner

researcher = Agent(
    name="Sales Researcher",
    instructions=(
        "Find five qualified fintech leads. "
        "Return JSON with name, company, and pain_point."
    ),
)

result = Runner.run_sync(
    researcher,
    "Find leads for our payments API.",
)
print(result.final_output)

Runner.run_sync and the underlying Runner.run are part of the OpenAI Agents SDK (openai.github.io/openai-agents-python).

Step 4: score the run with fi.evals

Score the agent output with the cloud task_completion template, or any other template in the Future AGI evaluation catalog (docs.futureagi.com).

from fi.evals import evaluate

scored = evaluate(
    eval_templates="task_completion",
    inputs={
        "input": "Find leads for our payments API.",
        "output": str(result.final_output),
        "expected_output": "A JSON list of five fintech leads.",
    },
    model_name="turing_flash",
)

print(scored.eval_results[0].metrics[0].value)

turing_flash returns in roughly 1-2 seconds, turing_small in 2-3 seconds, and turing_large in 3-5 seconds. Pick the tier that matches your latency budget.

Step 5: close the loop with a custom judge

When the catalog templates do not fit the rubric (e.g., a custom JSON schema, brand tone, regulatory clause), define a local judge with CustomLLMJudge and wrap it with Evaluator.

import os

from fi.evals.metrics import CustomLLMJudge
from fi.evals.llm import LiteLLMProvider
from fi.opt.base import Evaluator

judge_model = os.getenv("JUDGE_MODEL", "gpt-4o-mini")

schema_judge = CustomLLMJudge(
    name="lead_schema_judge",
    grading_criteria=(
        "Score 1 if the agent output is a JSON list of five "
        "items, each with non-empty name, company, and pain_point. "
        "Otherwise score 0."
    ),
    provider=LiteLLMProvider(model=judge_model),
)

evaluator = Evaluator(metric=schema_judge)
score = evaluator.evaluate(
    output=str(result.final_output),
    context="Sales leads schema v1",
)
print(score)

Pipe the judge output back into the prompt-opt loop. When a candidate prompt scores higher, promote it as the new production version.

Where the BYOK Agent Command Center fits

Agents in production usually need more than OpenAI. Some teams route long-context tasks to Claude, vision tasks to Gemini, and cheap classification to a small model. The Agent Command Center at /platform/monitor/command-center is the BYOK gateway that does this routing behind one endpoint while keeping the same traceAI spans and fi.evals scoring across providers. Guardrails (PII redaction, prompt injection checks, output filters) run inside the gateway before the request leaves your tenant.

AgentKit Evals vs Future AGI Evals (and where to use each)

Capability	AgentKit Evals	Future AGI Evals
Build-phase test datasets	Native dataset builder in Agent Builder	Synthetic dataset generator scoped to a knowledge base
Single-run trace grading	Yes, inside the canvas	Yes, with multi-template scoring
Continuous online evals on production traffic	Limited	First-class: every span scored against templates
Multi-modal scoring (image, audio, video)	Limited	First-class via multimodal catalog templates
Custom rubrics	Free-text grading	`CustomLLMJudge` + LiteLLM provider
Closed-loop prompt-opt	Auto prompt-rewrite inside Builder	Prompt-opt informed by production eval scores
Open-source SDK	No	Yes (Apache 2.0 `ai-evaluation`)

The pattern most teams converge on: AgentKit Evals during build, Future AGI for everything after the first deploy.

Common failure modes the integration catches

Loops. traceAI flags spans whose duration exceeds a configured budget.
Tool errors swallowed. Custom judge scores the structural validity of every tool response.
Hallucinated tool arguments. tool_call_accuracy template compares the called arguments against the schema.
Drifted system prompt. Prompt-opt loop alerts when a prompt change drops eval scores below baseline.
Cost spikes. Cost spans rolled up per project; alerts fire when a percentile crosses threshold.

Wrap-up

OpenAI AgentKit handles the long-standing pain of orchestration, UI, and connectors. Future AGI handles the long-standing pain of observability, evaluation, and continuous improvement. The integration is a few lines of Python (Step 2 above) and works whether the agent was authored visually or in the SDK. Future AGI’s traceAI and ai-evaluation SDKs are Apache 2.0 licensed on GitHub, so teams keep the option to swap pieces as the ecosystem evolves. Refer to the AgentKit documentation for the current OpenAI license terms.

To go deeper see AI agent cost and observability in 2026, the agent evaluation frameworks comparison, and the agent debugging tools roundup. To start instrumenting today, install traceAI-openai-agents and follow Step 2.

Frequently asked questions

What is OpenAI AgentKit?

OpenAI AgentKit is a toolkit released by OpenAI on October 6, 2025 for building, deploying, and operating LLM agents. It bundles four pieces. Agent Builder is a visual workflow canvas. ChatKit is an embeddable chat UI. The Agents SDK gives Python and TypeScript control over agent logic and tools. The Connector Registry manages OAuth and API-key access to data sources, MCP servers, and built-in tools like web search and code interpreter.

What does Future AGI add on top of AgentKit?

AgentKit handles orchestration and the front-end. Future AGI adds the reliability layer: distributed tracing via the Apache 2.0 traceAI library, online evaluations via `fi.evals` cloud metrics, synthetic data generation for edge cases, prompt-opt loops, and a BYOK Agent Command Center gateway for multi-provider routing. Together they cover build, run, observe, evaluate, and optimize for the full agent lifecycle.

How do I auto-instrument an OpenAI AgentKit agent with Future AGI?

Install the `traceAI-openai-agents` package, register a trace provider with `register(project_type=ProjectType.OBSERVE, project_name=...)`, then call `OpenAIAgentsInstrumentor().instrument(tracer_provider=...)`. Every run, tool call, handoff, and LLM call streams into the Future AGI dashboard as OpenTelemetry spans. No code changes are needed inside the agent itself. The library is Apache 2.0 licensed on GitHub at future-agi/traceAI.

Can I use Future AGI with agents built outside AgentKit?

Yes. Future AGI is provider-agnostic. traceAI ships instrumentors for OpenAI Agents SDK, LangChain, LlamaIndex, CrewAI, AutoGen, LiteLLM, Vertex AI, Anthropic, Mistral, and more. Any framework that emits OpenTelemetry spans can also be ingested directly. The `fi.evals` evaluator works on any output regardless of how the output was generated.

How does Future AGI score AgentKit traces?

Use `evaluate(eval_templates="context_adherence", inputs={...})` or any of the catalog templates such as `faithfulness`, `groundedness`, `tool_call_accuracy`, or `task_completion`. Three judge models are exposed: `turing_flash` for roughly 1-2 second cloud latency, `turing_small` for roughly 2-3 seconds, and `turing_large` for roughly 3-5 seconds. Custom rubrics are built with `CustomLLMJudge` from `fi.evals.metrics`.

Is the integration open source?

traceAI is Apache 2.0 (github.com/future-agi/traceAI/blob/main/LICENSE). The `ai-evaluation` Python SDK that exposes `fi.evals` is Apache 2.0 (github.com/future-agi/ai-evaluation/blob/main/LICENSE). The Future AGI cloud platform, dashboards, and judge models are the proprietary layer that the open SDKs talk to.

What does the Agent Command Center add?

The Agent Command Center is the BYOK gateway and policy plane at `/platform/monitor/command-center`. It routes calls from your agent to OpenAI, Anthropic, Google, AWS Bedrock, Azure OpenAI, and local models through a single endpoint, enforces guardrails and PII filters before the request leaves your tenant, and emits the same OpenTelemetry spans that the auto-instrumentation produces.

Where does AgentKit Evals end and Future AGI Evals begin?

AgentKit Evals are good for in-development scoring of a single workflow run. Future AGI scores live production traffic continuously, supports multi-modal evaluation, and pairs with the prompt-opt loop to push improvements back into the agent. Most teams use AgentKit Evals during the build phase and Future AGI for post-deploy observability and continuous evaluation.

View all

Guides

Future AGI vs Comet/Opik (2026): The Real Comparison

Future AGI vs Comet (Opik) in 2026. Pricing, multi-modal eval, LLM observability, G2 ratings, MLOps. Side-by-side for AI teams shipping LLM features.

Rishav Hada · Jul 29, 2025

8 min

Guides

Future AGI vs LangSmith 2026: LLM Eval and Observability Compared

Future AGI vs LangSmith in 2026: framework-agnostic LLM evaluation vs LangChain-native observability. Feature table, pricing, multi-modal coverage, verdict.

Rishav Hada · Jul 29, 2025

8 min

Guides

Build a Generative AI Chatbot in 2026: Step-by-Step Guide

Build a generative AI chatbot in 2026: model selection, RAG, prompt-opt, evaluation, observability, guardrails, gateway. Step-by-step with current tooling.

Rishav Hada · Jul 24, 2025

8 min