Agents

What Is the OpenAI Agent SDK?

OpenAI's framework for building traceable agents with tools, handoffs, guardrails, streaming, and production evaluation hooks.

What Is the OpenAI Agent SDK?

The OpenAI Agent SDK is OpenAI’s framework for building traceable agents that can call tools, hand off tasks, run guardrails, stream results, and record execution spans. It is an agent framework, not a model or standalone orchestrator. In production, it shows up as agent runs, function-tool calls, guardrail checks, handoff spans, and final outputs. FutureAGI connects through traceAI:openai-agents to evaluate tool choice, task completion, safety failures, latency, and token cost per trace.

Why It Matters in Production LLM and Agent Systems

Wrong SDK assumptions turn ordinary LLM errors into workflow faults. An OpenAI Agent SDK support agent can choose the refund tool for an order-status request, hand off to a billing specialist without the account context, or let a guardrail run only at the first agent boundary while a later tool call does the risky work. The user sees a bad answer. The platform team sees a long trace with the wrong branch.

Developers feel the pain when local examples pass but production runs diverge after a new tool, handoff, model, or instruction prefix ships. SREs see p99 latency rise because a manager agent keeps retrying a slow function. Product owners see task-completion drops in one cohort. Compliance reviewers ask why a sensitive tool was called and need a trace that explains the agent’s decision.

This matters more in 2026 multi-step systems because the SDK is often one layer in a wider stack: OpenAI models (GPT-5.x), hosted tools, custom function tools, MCP servers, vector retrieval, human review, and gateway policies. The built-in tracing surface is useful, but traces alone do not say whether the agent picked the right action. A production team needs evaluation on top of the trace: tool accuracy, handoff quality, prompt-injection exposure, cost per run, and task completion by workflow version.

How FutureAGI Handles the OpenAI Agent SDK

FutureAGI’s approach is to treat the OpenAI Agent SDK as a production execution surface, not just a development library. measurable from /platform/evaluate. With traceAI:openai-agents, an agent run can be captured as a trace containing planner activity, function-tool calls, handoff spans, guardrail checks, model calls, latency, status, and token fields such as llm.token_count.prompt. The key field for agent analysis is the ordered path, usually represented by agent.trajectory.step.

Example: a fintech support agent receives “Where is my refund?” The SDK triage agent may call refund_lookup, hand off to a policy agent, and produce a final customer response. FutureAGI evaluates that run with ToolSelectionAccuracy for the chosen tool, TaskCompletion for the final outcome, TrajectoryScore for the path, and PromptInjection when the request contains adversarial instructions. If the tool is wrong or the handoff loops, the failed trace becomes a regression case.

Unlike a LangSmith-style trace review that asks an engineer to inspect the path manually, this workflow turns OpenAI Agent SDK behavior into release criteria. The engineer can set an eval-fail-rate threshold for openai-agents traces, alert when agent.trajectory.step repeats beyond policy, route risky prompts through an Agent Command Center pre-guardrail, or block a deployment when a prompt change lowers task completion. The result is a measurable agent loop instead of a pile of screenshots from staging.

How to Measure or Detect It

Measure the OpenAI Agent SDK by scoring both final outcomes and the intermediate decisions that produced them.

  • ToolSelectionAccuracy: evaluates whether the SDK agent selected the expected tool for the user’s intent.
  • TaskCompletion: checks whether the run completed the assigned task, not just whether the final text looked plausible.
  • TrajectoryScore: scores the path through agent steps, tools, and handoffs.
  • PromptInjection: detects prompt-injection risk in inputs or agent-visible context.
  • Trace signals: repeated agent.trajectory.step, rising llm.token_count.prompt, handoff count, tool-timeout rate, guardrail trip rate, p99 latency, and token-cost-per-trace.
  • User proxies: thumbs-down rate, escalation rate, reopened-ticket rate, and manual-review rate by SDK workflow version.
from fi.evals import ToolSelectionAccuracy, TaskCompletion

tool_eval = ToolSelectionAccuracy()
task_eval = TaskCompletion()
tool_score = tool_eval.evaluate(trajectory=trace_spans, expected_tool="refund_lookup")
task_score = task_eval.evaluate(trajectory=trace_spans, expected_outcome="refund status returned")
SDK surfaceFAGI evaluatorTrace artifactCommon failure
Agent instructionsPromptAdherenceagent span + system promptAgent ignores system prompt
Function toolToolSelectionAccuracytool call span + schemaWrong tool chosen
HandoffTrajectoryScore + payload checkhandoff spanState drops on transfer
GuardrailPromptInjection, Toxicity, PIIguardrail spanRail not applied past entry agent
Run outputTaskCompletionfinal response + trajectoryPlausible answer, wrong outcome

Public anchors: Anthropic’s τ-bench (≈220 multi-turn customer-support tasks, frontier pass^1 ~50-60%) and Berkeley’s BFCL v3 function-calling leaderboard (multi-turn pass rate ~55-65% for frontier models) are the standard 2026 stress tests for SDK-style agent runtimes. Use them as the upper bound when calibrating in-app ToolSelectionAccuracy and TaskCompletion thresholds for an OpenAI Agent SDK workflow. a route scoring below 40% on either benchmark is in the long tail of agentic-AI capability, not the leading edge.

What to evaluate at each SDK surface

The OpenAI Agent SDK exposes four named surfaces, and each one wants a different evaluator. Instruction adherence at the agent boundary needs PromptAdherence. does the agent’s response follow the agent-level system prompt? Tool choice at each function-tool call wants ToolSelectionAccuracy. was that the right tool for this state? Handoff quality at every transfer wants TrajectoryScore plus a check that the handoff payload preserved task state. Guardrail correctness wants PromptInjection, Toxicity, or PII depending on the rail.

Without this split, teams attach a single end-to-end evaluator to the run, watch it fail, and have no way to attribute. In our 2026 evals, OpenAI Agent SDK runs that decompose evaluation across surfaces resolve incidents in roughly half the time as runs that only score the final response. The traces look identical; the eval policy is the difference.

A second 2026-specific note: GPT-5.x’s tool-use defaults are aggressive. it will call a tool when a simple completion would suffice. That shows up in trace data as elevated agent.trajectory.step counts and inflated cost-per-trace. The fix is usually in the agent’s instructions field (telling it explicitly when to not use tools), but you only spot the pattern by running StepEfficiency across the agent’s cohort, not by reading individual traces.

Common Mistakes

The common errors are usually runtime errors, not syntax errors:

  • Evaluating only the final answer. A correct answer can hide an unauthorized tool call, unnecessary handoff, or expensive retry path.
  • Assuming SDK tracing is evaluation. A trace records what happened; it does not prove the chosen tool or path was correct.
  • Putting guardrails only on the entry agent. Handoffs and function tools may need their own checks for sensitive actions.
  • Ignoring workflow versioning. Without prompt, tool, and model versions, a failed trace cannot become a reliable regression test.
  • Treating handoffs like ordinary function calls. A handoff changes ownership of the task, context window, and final-output contract.
  • Pinning to a single GPT-5.x snapshot for too long. OpenAI rotates GPT-5.x defaults more often than the model string suggests; rerun your full regression eval at least monthly even when the version label looks unchanged.

Frequently Asked Questions

What is the OpenAI Agent SDK?

The OpenAI Agent SDK is OpenAI's framework for building traceable agents that call tools, hand off work, run guardrails, stream responses, and emit production traces.

How is the OpenAI Agent SDK different from the OpenAI API?

The OpenAI API gives applications access to models and model-side capabilities. The OpenAI Agent SDK adds agent runtime primitives such as agents, tools, handoffs, guardrails, streaming, and tracing.

How do you measure OpenAI Agent SDK reliability?

FutureAGI measures OpenAI Agent SDK runs through traceAI:openai-agents fields such as agent.trajectory.step and evaluators including ToolSelectionAccuracy, TaskCompletion, TrajectoryScore, and PromptInjection.