Agents

What Is AutoGen?

AutoGen is an open-source framework for conversational LLM applications with agents, tools, code execution, and human-in-the-loop workflows.

What Is AutoGen?

AutoGen is an open-source agent framework for building conversational single-agent and multi-agent LLM applications where agents exchange messages, call tools, run code, and sometimes involve a human reviewer. It belongs to the agent framework family and shows up in production as group-chat turns, routed agent messages, tool calls, termination decisions, and trace spans. FutureAGI instruments AutoGen with traceAI:autogen so teams can score trajectories, tool choices, task completion, cost, and regressions across agent versions.

AutoGen v0.5 (Microsoft, late 2025) sharpened the framework around three tiers: AgentChat (high-level multi-agent conversations), Core (event-driven actor runtime), and Extensions (tools, models, code executors). By May 2026 it sits alongside LangGraph 1.x, CrewAI 0.80+, OpenAI Agents SDK, Agno, and BeeAI as one of the dominant Python multi-agent stacks. frequently chosen by enterprise teams already standardized on Microsoft platforms.

Why It Matters in Production LLM and Agent Systems

AutoGen failures usually appear in the conversation between agents, not only in the final answer. A planner can select the wrong specialist, a coder agent can run unsafe code, a reviewer can be skipped, or a group chat can continue after the task is already complete. Those are agent failure modes: wrong speaker selection, excessive tool use, stale shared context, and runaway conversation loops.

Developers feel the pain first because AutoGen makes it easy to prototype collaboration patterns, but harder to inspect every turn after a production incident. SREs see p99 latency, retry count, and token-cost-per-trace rise when one task triggers many agent messages. Product teams see inconsistent task completion by workflow type. Compliance teams need to know which agent invoked a tool, whether a human was required, and what evidence supported the final action.

Common symptoms include:

  • Repeated group-chat turns.
  • Missing termination conditions.
  • Tool calls with empty or conflicting arguments.
  • Code execution errors hidden in intermediate messages.
  • Traces where the final response looks correct but the path was unsafe.

This matters more in 2026 multi-step systems because AutoGen runs often sit beside MCP tools, RAG retrievers, gateway policies, and human review. A single user request can cross models (GPT-5.x, Claude Opus 4.7), tools, and agents before the user sees one sentence.

How FutureAGI Handles AutoGen

FutureAGI’s approach is to treat AutoGen as a traceable multi-agent runtime, not a black-box conversation transcript. In a FutureAGI workflow, traceAI:autogen instruments AutoGen AgentChat and Core runs so group-chat turns, routed agent messages, tool calls, model spans, and termination checks share one trace id. The key trace field is agent.trajectory.step; token and cost analysis use fields such as llm.token_count.prompt.

The evaluator surface for AutoGen runs:

EvaluatorWhat it checksThreshold
ToolSelectionAccuracyRight tool per step≥ 0.90 on write-actions
TaskCompletionFinal outcome matches goal≥ 0.85
TrajectoryScorePath quality≥ 0.80
StepEfficiencyWasted turns or tool callsMedian within 1.3x optimal
GroundednessLLM message support≥ 0.80 for cited answers
PromptInjectionInjected tool returnsHard block on positive
ActionSafetyCode execution / write toolHard block on policy violation

Example: a financial operations team runs an AutoGen team with a planner, a policy analyst, a database tool agent, and a reviewer. A refund request should move through classify_request -> policy_lookup -> account_check -> reviewer_approval -> final_response. FutureAGI records each step, scores the selected tool with ToolSelectionAccuracy, checks the outcome with TaskCompletion, and evaluates the path with TrajectoryScore and StepEfficiency.

Unlike LangGraph’s explicit state graph, AutoGen often expresses collaboration through agent conversation. That makes trace quality critical: the engineer needs to know who spoke, why the next speaker was chosen, what tool ran, and why the chat stopped. If a release causes the planner to call the database agent twice and skip the reviewer, FutureAGI can raise an eval-fail-rate-by-agent-version alert. The next action is concrete: tighten speaker rules, add a termination condition, route risky actions through human approval, and run the failed traces as a regression dataset.

In our 2026 evals across AutoGen-based fintech and B2B-SaaS deployments, StepEfficiency is the second-most-actionable signal after ToolSelectionAccuracy. group chats routinely add 30-60% more turns than necessary, and pruning them shaves 20%+ off token cost without changing outcome quality. Public function-calling and agent benchmarks set the reference floor here: BFCL v3 (Berkeley Function Calling Leaderboard v3, multi-turn and multi-step categories) has frontier models clearing 90%+ on single tool calls but falling 10-20 points when state must persist across turns, and τ-bench (Anthropic, ~165 customer-support tasks) shows the same gap on end-to-end resolution rate.

How to Measure or Detect AutoGen Issues

Measure AutoGen by scoring both the final result and the conversation path that produced it:

  • ToolSelectionAccuracy. agent chose the expected tool for the task and step.
  • TaskCompletion. AutoGen team completed the requested outcome.
  • TrajectoryScore. ordered sequence of agent decisions, messages, observations, actions.
  • StepEfficiency. unnecessary turns, repeated work, avoidable tool calls.
  • Groundedness and Faithfulness. LLM-message-level grounding to retrieved context.
  • PromptInjection. injected commands in returned tool output.
  • Trace signals. repeated agent.trajectory.step, missing parent spans, high tool-timeout rate, rising llm.token_count.prompt, token-cost-per-trace by AutoGen team.
  • User proxies. thumbs-down rate, reopened-ticket rate, escalation rate, manual-review overrides after AutoGen-assisted workflows.

Minimal Python:

from fi.evals import ToolSelectionAccuracy, TaskCompletion, TrajectoryScore, StepEfficiency

tool_eval = ToolSelectionAccuracy()
done_eval = TaskCompletion()
path_eval = TrajectoryScore()
eff_eval = StepEfficiency()

tool_result = tool_eval.evaluate(
    trajectory=trace_steps,
    expected_tool="policy_lookup",
)
done_result = done_eval.evaluate(input=user_request, output=final_answer)
path_result = path_eval.evaluate(trajectory=trace_steps)
eff_result = eff_eval.evaluate(trajectory=trace_steps)
print(tool_result.score, done_result.score, path_result.score, eff_result.score)

Common Mistakes

  • Treating AutoGen as chat transcripts only. Without per-agent step labels, group-chat bugs become final-answer debugging.
  • Letting speaker selection stay implicit. A wrong next speaker can cause loops, skipped review, or tool calls from the wrong role.
  • Scoring only the last assistant message. AutoGen can reach a good answer through unsafe code execution or unauthorized tool use.
  • Sharing memory across agents without ownership. Planner notes, reviewer objections, and tool outputs need scoped state and traceable writes.
  • Migrating without regression traces. The same task may change iteration count, tool retries, termination behavior, and human-review points.
  • No termination condition. Group chats can continue past goal completion until token cost spikes.
  • Skipping code-execution sandboxing. AutoGen’s code executor needs the same safety guardrails as any tool that can write state.

Frequently Asked Questions

What is AutoGen?

AutoGen is an open-source framework for building conversational LLM agents and multi-agent applications. Agents can exchange messages, call tools, execute code, and involve a human reviewer.

How is AutoGen different from LangGraph?

AutoGen centers on conversational agents, group chats, routed messages, and agent teams. LangGraph centers on explicit state graphs, nodes, and edges for controllable agent workflows.

How do you measure AutoGen?

FutureAGI measures AutoGen with traceAI:autogen spans such as agent.trajectory.step, then scores runs with ToolSelectionAccuracy, TaskCompletion, TrajectoryScore, and StepEfficiency.