Articles

The Open-Source AI Agent Stack in 2026: LangChain, LangGraph, CrewAI, AutoGen, Agents SDK, MS Agent Framework, Mastra Plus FAGI traceAI Apache 2.0

Open-source AI agent stack 2026: LangGraph, CrewAI, AutoGen, OpenAI Agents SDK, MS Agent Framework, Mastra, plus FAGI traceAI + ai-evaluation OSS.

July 19, 2025

Updated May 14, 2026

11 min read

agents open-source agent-stack traceAI 2026

Table of Contents

An engineering team builds an AI agent for internal document Q&A. They wire LangGraph, LlamaIndex with Qdrant, Ollama running Llama 3.1, and ship to a small pilot. Six weeks in, an executive asks two questions: which tool calls are failing, and how do we know the answers are not hallucinated. The team has no answer for either. The framework runs; the model serves; the retriever retrieves; there is no trace layer and no eval layer. This is the 2025-vintage open-source AI agent stack: four of the six layers, missing the two that matter most for production. This post is the 2026 picture: the six-layer reference architecture, the OSS picks for each layer, and how FutureAGI’s Apache 2.0 trace and eval libraries close the gap.

TL;DR: The six-layer OSS agent stack in 2026

Layer	OSS picks (production-grade)	Pick when
Trace and evaluation	FutureAGI traceAI (Apache 2.0), FutureAGI ai-evaluation (Apache 2.0), OpenInference, Phoenix, Langfuse OSS	Always. Non-optional in 2026.
Tooling and integration	MCP servers, LangChain tools, custom function calling	Whenever the agent talks to anything external
Retrieval and memory	LlamaIndex, Haystack, Qdrant, Milvus, Weaviate, Chroma, Mem0, Letta	RAG, multi-turn agents, persistent memory
Agent framework	LangGraph, CrewAI, AutoGen, OpenAI Agents SDK, Microsoft Agent Framework, Mastra, Pydantic AI	Always
Model layer	Llama 4.x, Mistral, Qwen 3, DeepSeek-V3, Mixtral	Self-hosted or weight-portable
Model serving	vLLM, SGLang, Ollama, TensorRT-LLM, OpenLLM	When you serve the model yourself

If you only read one row: the trace + eval layer is the one most 2025-era stacks miss. FutureAGI traceAI and ai-evaluation are Apache 2.0 and the recommended OSS pick for that layer regardless of which framework you build on top of.

Layer 1: Model serving

What it does: turns a model checkpoint into a callable HTTP endpoint with batching, KV cache, and latency control.

vLLM

The PagedAttention server. State of the art for throughput; production-grade. Continuous batching, KV-cache management, quantization (AWQ, GPTQ, FP8), tensor parallelism. The default when you serve open models at scale.

Repo: https://github.com/vllm-project/vllm

SGLang

The fast structured-output server. Strong on JSON-mode and tool-call workloads, often faster than vLLM on tool-using agents thanks to RadixAttention.

Repo: https://github.com/sgl-project/sglang

Ollama

Local-first, friction-free. The pick for dev laptops, edge boxes, and small deployments. Not production-throughput-grade but excellent UX.

Repo: https://github.com/ollama/ollama

TensorRT-LLM

NVIDIA’s heavily optimized stack. The fit when you have NVIDIA hardware and need every last token per second.

Docs: https://github.com/NVIDIA/TensorRT-LLM

OpenLLM

BentoML’s serving framework. Unified deployment surface across cloud providers; weaker on raw throughput but stronger on ops integration.

Repo: https://github.com/bentoml/OpenLLM

Pick: vLLM for scale, Ollama for local, SGLang when tool-call latency matters most.

Layer 2: The model

Open-weight models in 2026 are competitive with frontier closed models on most tasks. The picks:

Llama 4.x (Meta)

Meta’s 2025-2026 release. Scout (17B active / 109B total MoE) and Maverick (17B active / 400B total MoE) variants. Strong on multimodal and long context (1M+ tokens in the larger variants). The default for general-purpose open agents.

Mistral and Mixtral (Mistral AI)

Mistral Large 2, Mistral Small 3, Mixtral 8x22B. Strong on European-language workloads and tool-calling. Apache 2.0 licensing on most variants.

Qwen 3 (Alibaba)

The Qwen 3 family released late 2025, covering 0.5B to 235B-A22B (MoE). Strong on coding, multilingual, and Asian-language workloads. Apache 2.0 on smaller variants.

DeepSeek-V3 and DeepSeek-R1 (DeepSeek)

Strong on reasoning and math. DeepSeek-R1 popularized open reasoning-trace models in early 2025. MoE architecture; cost-effective per token.

Gemma 3 (Google)

Google’s open-weight family, 2B to 27B. Strong on instruction following at small scale; the pick for resource-constrained deployments.

Pick: Llama 4.x for general agents, Qwen 3 for coding, DeepSeek-V3 for reasoning-heavy workloads, Mistral for European compliance.

Layer 3: The agent framework

This is where the most movement happened in 2025-2026.

LangGraph (LangChain)

Stateful, graph-based agent runtime. Built on LangChain, but the runtime contract is graph-of-nodes with explicit state. The production default for stateful agents in 2026.

Strengths: native persistence, step-by-step debugging, OTel tracing via traceAI, mature human-in-the-loop primitives.

Repo: https://github.com/langchain-ai/langgraph

CrewAI

Role-based multi-agent. Define a researcher agent, a writer agent, a reviewer agent; CrewAI handles delegation, sequencing, and state. Low ceremony; strong for content-pipeline and analyst-style workflows.

Repo: https://github.com/crewAIInc/crewAI

AutoGen (Microsoft)

Conversational multi-agent. AutoGen’s pattern is group chat between agents with roles. The 2024 version is the reference; the Microsoft Agent Framework (2025) is the modern successor.

Repo: https://github.com/microsoft/autogen

Microsoft Agent Framework

Microsoft’s 2025 unification of AutoGen and Semantic Kernel agent primitives. Stable runtime for multi-agent dispatch with .NET and Python bindings. The fit for Microsoft-stack enterprises.

Repo: https://github.com/microsoft/agent-framework

OpenAI Agents SDK

The OpenAI-native option. Tool-call loop built on the OpenAI Responses API. Lightest-weight; fastest path from notebook to production for OpenAI-only stacks.

Repo: https://github.com/openai/openai-agents-python

Mastra

TypeScript-first agent framework. The pick for Node.js codebases that do not want to bolt onto Python. Strong on workflow primitives, integrations with TS observability.

Repo: https://github.com/mastra-ai/mastra

Pydantic AI

Typed Python agent framework on top of Pydantic. The fit for Python codebases that already use Pydantic for everything; native type safety on agent inputs and outputs.

Repo: https://github.com/pydantic/pydantic-ai

Pick: LangGraph for stateful production, CrewAI for role-based, Agents SDK for OpenAI-native, Mastra for TS, Pydantic AI for typed Python. The framework is replaceable; the trace + eval layer behind it is the constant.

For a deeper comparison, see Best Multi-Agent Frameworks 2026 and OSS Agent Frameworks 2026.

Layer 4: Retrieval and memory

LlamaIndex

The retrieval-first framework. Loaders for hundreds of data sources, index types beyond plain vector search (knowledge graphs, hierarchical, hybrid), AgentWorkflow for retrieval-led agents.

Repo: https://github.com/run-llama/llama_index

Haystack

deepset’s framework. Strong on enterprise RAG pipelines, document processing, evaluation primitives baked in.

Repo: https://github.com/deepset-ai/haystack

Vector databases

The picks: Qdrant (fast, Rust-native, strong on filters), Milvus (Zilliz, scale-out indexing), Weaviate (hybrid search + GraphQL), Chroma (lightweight, dev-first), pgvector (Postgres-native, ops-friendly).

For deep comparison see Best Vector Databases for RAG 2026.

Memory frameworks

Mem0, Letta (formerly MemGPT), Zep are the picks for persistent agent memory. Each handles the memory.retrieve span differently; pair with FutureAGI traceAI to score retrieval quality per span.

Pick: LlamaIndex when retrieval is the primary primitive; Haystack for enterprise document pipelines; Qdrant or pgvector for the vector layer; Mem0 or Letta when persistent memory matters.

Layer 5: Tooling and integration

Model Context Protocol (MCP)

The 2026 standard for connecting agents to tools and data sources. Anthropic-originated, now supported by OpenAI, Google, and the major frameworks. Python and TypeScript SDKs at https://github.com/modelcontextprotocol.

The pattern: MCP servers expose tools, resources, and prompts; MCP clients (any LLM agent) call them. The benefit is tool portability: one MCP server works with any framework that speaks MCP.

LangChain tools

The pre-MCP standard. Hundreds of pre-built integrations. Still widely used because the LangChain ecosystem is the largest; new tools tend to be exposed both as LangChain tools and as MCP servers.

Custom function calling

The lowest-level option. Define a function schema, register it with the framework, the model emits structured calls. Every major framework supports this directly.

Pick: MCP for tools you want to be portable across frameworks; LangChain tools when the integration already exists; custom function calling for proprietary tools.

Layer 6: Trace and evaluation (the most-skipped, highest-impact layer)

FutureAGI traceAI (Apache 2.0)

The OTel-based instrumentation layer for AI agents. Ships instrumentors for every major framework: LangChain, LangGraph, CrewAI, AutoGen, OpenAI Agents SDK, LlamaIndex, Haystack, OpenAI, Anthropic, Google, Mistral, and more. One line of register() enables tracing; spans flow into FutureAGI’s backend or any OTel-compatible store.

License verified Apache 2.0 at https://github.com/future-agi/traceAI/blob/main/LICENSE.

FutureAGI ai-evaluation (Apache 2.0)

The 50+-metric eval library (72 local metrics in the 1.1 release). Templates for faithfulness, context_relevance, hallucination, instruction_following, brand_tone, custom LLM-as-judge. Use as a Python import (local metrics, sub-second) or as cloud evaluators (turing_flash ~1-2s, turing_small ~2-3s, turing_large ~3-5s) with different latency-quality tradeoffs.

Repo: https://github.com/future-agi/ai-evaluation

OpenInference (Apache 2.0)

Arize’s instrumentation convention. Compatible with traceAI; the two libraries emit similar span shapes. Pick traceAI for the broader framework coverage and FutureAGI back-end integration; pick OpenInference when you are Phoenix-native.

Repo: https://github.com/Arize-ai/openinference

Phoenix (Arize)

Open-source LLM observability backend. Hosts spans, runs evaluators, serves dashboards. Apache 2.0; self-hostable. The Phoenix + OpenInference stack is the alternative if you do not use the FutureAGI back-end.

Repo: https://github.com/Arize-ai/phoenix

Langfuse (open core)

Open-source LLM observability with paid cloud tier. Trace store, prompt management, evaluation harness. The fit if you want a OSS-with-managed-option that is comparable to FutureAGI’s surface.

Repo: https://github.com/langfuse/langfuse

Pick: traceAI + ai-evaluation (FutureAGI’s Apache 2.0 stack) for the broadest framework coverage and the most complete eval template library. OpenInference + Phoenix for the Arize-native path. Langfuse for an alternative open-core dashboard.

Reference architecture: one assembled stack

A working 2026 OSS agent stack:

# docker-compose.yml (sketch)
# Real example for an OSS agent stack with FAGI trace + eval

services:
  vllm:
    image: vllm/vllm-openai:latest
    command: --model meta-llama/Llama-4-Scout-17B-16E-Instruct
    ports: ["8000:8000"]

  qdrant:
    image: qdrant/qdrant:latest
    ports: ["6333:6333"]

  agent:
    build: ./agent  # LangGraph + LlamaIndex + traceAI instrumented
    environment:
      FI_API_KEY: ${FI_API_KEY}
      FI_SECRET_KEY: ${FI_SECRET_KEY}
      VLLM_BASE: http://vllm:8000/v1
      QDRANT_HOST: qdrant

  # Optional: self-hosted FutureAGI backend, or point traceAI at the cloud
  # See docs.futureagi.com for self-host instructions

Inside agent/:

# agent/main.py: LangGraph + traceAI + fi.evals (abbreviated wiring; assumes you define
# AgentState, retrieve_node, generate_node, and user_query for your own corpus and model)
# pip install "traceAI-langchain[langgraph]" ai-evaluation llama-index qdrant-client

from fi_instrumentation import register
from traceai_langchain import LangChainInstrumentor  # covers LangGraph via [langgraph] extra
from fi.evals import evaluate

# 1. Wire OTel tracing for LangChain/LangGraph
register(project_name="oss-agent-stack")
LangChainInstrumentor().instrument()

# 2. Define the agent (LangGraph). AgentState/retrieve_node/generate_node are user-defined
from langgraph.graph import StateGraph
graph = StateGraph(AgentState)
graph.add_node("retrieve", retrieve_node)
graph.add_node("generate", generate_node)
graph.add_edge("retrieve", "generate")

# 3. Run the agent
result = graph.invoke({"query": user_query})

# 4. Score the result with fi.evals
score = evaluate(
    "faithfulness",
    output=result["answer"],
    context="\n".join(result["chunks"]),
)
print(f"Faithfulness: {score.score:.3f}")

That gives you the wiring for an OSS agent: framework, retrieval, model serving, and tracing/eval hooks, in roughly 100 lines and a Docker Compose sketch. The compose file above is partial: traceAI by default ships spans to the Future AGI cloud (via FI_API_KEY / FI_SECRET_KEY) or to your own OTel backend if you set the standard OTel env vars. Add a guardrail layer (FAGI Protect, or NeMo Guardrails for the OSS-only path) and you have a production starter.

Why pick OSS over hosted in 2026

Three reasons.

First, control. The OSS stack runs in your VPC. No data leaves. Audits are straightforward. The compliance story for HIPAA, SOC 2, EU AI Act, and sectoral rules is materially simpler when the data plane is yours.

Second, cost predictability. Hosted platforms charge per seat or per event; OSS charges per CPU/GPU. For high-traffic workloads, the unit economics flip in OSS’s favor past a certain scale.

Third, portability. Frameworks come and go. Two years ago everyone was on raw LangChain; today LangGraph and Agents SDK are the production runtimes. The OTel + OpenInference convention and the eval-attached-span pattern are the stable layer; the framework is replaceable.

The risk is operational. Someone has to run vLLM. Someone has to keep the trace store on disk. Someone has to update the framework versions and re-run the eval suite. The hosted platforms (including FutureAGI’s managed service) exist for teams that want the same Apache 2.0 trace + eval back-end without the ops burden.

The truthful 2026 framing: the OSS components are production-ready. The integration is the cost. FutureAGI’s value proposition for OSS-first teams is the Apache 2.0 cores; for teams that want one-click integration, the managed platform.

Failure modes to avoid

Skipping the trace + eval layer

The most common 2025-era mistake, and still common in 2026. A stack with framework + model + retrieval but no traces and no evals is a black box. The first production incident is unrecoverable.

The fix: wire traceAI on day one. Even before the agent works, the spans flow. When the agent breaks, you have the trace.

Coupling to one framework

A stack that uses LangChain primitives everywhere is hard to migrate when LangGraph or Agents SDK is a better fit. Keep the trace + eval layer framework-agnostic (it is, when you use traceAI’s instrumentors) and treat the framework as replaceable.

Self-hosting too much too fast

Running vLLM at scale is not trivial. Running a vector DB at scale is not trivial. Running an observability backend at scale is not trivial. Start by self-hosting the trace + eval (highest control value) and using managed endpoints for model serving and vector DB. Move to self-host as scale and confidence grow.

Mixing OTel conventions

traceAI emits OpenInference-shaped spans; some vendor SDKs emit their own. Mixing produces dashboards that report different metrics from the same agent run. Standardize on one convention (OpenInference) for the lifetime of the project.

For depth on observability picks, see Best Open Source LLM Observability 2026.

Where this is going in 2027

Three trends visible in mid-2026.

First, MCP becomes the universal tool layer. Most production agents will speak MCP by default; framework-specific tool registries fade.

Second, the model layer continues to commoditize. Open-weight MoE models reach frontier capability on most tasks; the differentiation shifts to the agent runtime and the trace + eval layer.

Third, the trace + eval back-end becomes the platform. Frameworks change every 12 to 18 months; the OTel + OpenInference + eval-attached-span pattern is the constant. Investment in this layer pays compounding returns.

How to start

If you are assembling an OSS agent stack in 2026:

Pick the framework: LangGraph for stateful, Agents SDK for OpenAI-native, CrewAI for role-based.
Pick the retriever: LlamaIndex with Qdrant or pgvector for most cases.
Pick the model serving: vLLM for scale, Ollama for local.
Wire traceAI (Apache 2.0). One line per framework via the matching instrumentor.
Attach fi.evals templates (Apache 2.0). Score every retrieve, generate, and judge span.
Add a guardrail. FAGI Protect for the integrated path, NeMo Guardrails for OSS-only.
Ship behind a gateway. FAGI Agent Command Center for BYOK across 100+ providers, MCP for tool routing.

The full stack runs in a Docker Compose for prototypes and a Helm chart for production. The Apache 2.0 trace + eval libraries make the observability layer portable: change framework, change model, change retriever, the spans look the same.

Sources

LangGraph: https://github.com/langchain-ai/langgraph
CrewAI: https://github.com/crewAIInc/crewAI
AutoGen: https://github.com/microsoft/autogen
Microsoft Agent Framework: https://github.com/microsoft/agent-framework
OpenAI Agents SDK: https://github.com/openai/openai-agents-python
Mastra: https://github.com/mastra-ai/mastra
Pydantic AI: https://github.com/pydantic/pydantic-ai
LlamaIndex: https://github.com/run-llama/llama_index
Haystack: https://github.com/deepset-ai/haystack
vLLM: https://github.com/vllm-project/vllm
SGLang: https://github.com/sgl-project/sglang
Ollama: https://github.com/ollama/ollama
Model Context Protocol: https://github.com/modelcontextprotocol
FutureAGI traceAI (Apache 2.0): https://github.com/future-agi/traceAI
FutureAGI ai-evaluation (Apache 2.0): https://github.com/future-agi/ai-evaluation
OpenInference (Apache 2.0): https://github.com/Arize-ai/openinference
Phoenix: https://github.com/Arize-ai/phoenix
Langfuse: https://github.com/langfuse/langfuse

Frequently asked questions

What is the open-source AI agent stack in 2026?

The open-source AI agent stack in 2026 has six layers: serving (vLLM, SGLang, Ollama), model (Llama 4.x, Mistral, Qwen 3, DeepSeek-V3), framework (LangGraph, CrewAI, AutoGen, OpenAI Agents SDK, Microsoft Agent Framework, Mastra, Pydantic AI), retrieval and memory (LlamaIndex, Haystack, vector DBs, Mem0), tooling and integration (MCP, LangChain tools, custom function calling), and trace and evaluation (FutureAGI traceAI Apache 2.0 plus ai-evaluation Apache 2.0, OpenInference). Each layer has multiple production-grade OSS choices, with the framework layer being the most fragmented.

Which OSS agent framework should I pick in 2026?

LangGraph for stateful, graph-based workflows with strong debug tooling. CrewAI for role-based multi-agent teams with low ceremony. AutoGen and the new Microsoft Agent Framework for conversational multi-agent orchestration. The OpenAI Agents SDK for OpenAI-native single-vendor stacks. Mastra for TypeScript-first builders. Pydantic AI for typed Python codebases. LangChain remains the lingua franca but is now usually used inside LangGraph rather than as the primary runtime. The framework is replaceable; the trace + eval back-end behind it is the constant.

What is the OSS observability layer for AI agents in 2026?

Two Apache 2.0 libraries dominate: FutureAGI traceAI and OpenInference. Both emit OTel-compatible spans for retrieve, tool, generate, and judge calls. traceAI ships ready-made instrumentors for LangChain, LangGraph, CrewAI, AutoGen, OpenAI Agents SDK, LlamaIndex, Haystack, and more. Spans flow into FutureAGI's self-hostable backend or any OTel-compatible store (Phoenix, Arize, Grafana Tempo, Honeycomb). OpenInference is the convention; traceAI is the production instrumentor; the eval layer (FutureAGI ai-evaluation Apache 2.0) attaches metric scores per span. This is the OSS observability + eval stack.

Is FutureAGI fully open source?

The trace and evaluation libraries are. traceAI (instrumentors for LangChain, LangGraph, CrewAI, AutoGen, OpenAI Agents SDK, LlamaIndex, Haystack, OpenAI, Anthropic, and more) is Apache 2.0 at https://github.com/future-agi/traceAI. The ai-evaluation library (50+ eval metrics, LLM-as-judge templates, custom evaluators) is Apache 2.0 at https://github.com/future-agi/ai-evaluation. The hosted FutureAGI platform layers a dashboard, gateway, guardrails, simulation, and prompt-optimize on top of those Apache 2.0 cores. Self-host the Apache 2.0 libraries or use the managed platform; the data model is the same.

What about MCP? Is it part of the OSS agent stack in 2026?

Yes. The Model Context Protocol, originated by Anthropic in late 2024, is the de facto standard for connecting agents to tools and data sources in 2026. MCP servers expose tools, resources, and prompts to MCP clients (any LLM agent). OpenAI, Google, Anthropic, and the major frameworks all support MCP. The Apache 2.0 Python and TypeScript MCP SDKs are at github.com/modelcontextprotocol. FutureAGI's Agent Command Center supports MCP routing so the same agent can call tools across multiple MCP servers with one gateway. MCP sits in the tooling layer of the stack.

Why pick OSS over a hosted agent platform?

Three reasons. First, control: an OSS stack runs in your VPC, your data never leaves, audits are straightforward. Second, cost predictability: license cost is zero, infra cost scales with usage rather than seats. Third, portability: the OSS components compose differently as your needs evolve. Frameworks come and go; the OTel + OpenInference convention and the eval-attached-span pattern are stable. The risk is operational overhead: someone has to keep the stack running. The managed FutureAGI platform exists for teams that want the same Apache 2.0 trace + eval back-end without the ops burden.

How does the OSS agent stack compare to a closed-source platform in 2026?

On capability, the gap is narrow. OSS frameworks cover stateful workflows (LangGraph), multi-agent (CrewAI, AutoGen), tool-using (Agents SDK), retrieval-first (LlamaIndex). OSS trace + eval (traceAI, OpenInference, ai-evaluation) cover the observability layer. The closed-source pull is end-to-end integration (one dashboard for traces, evals, prompts, guardrails, gateway). FutureAGI offers both: the OSS Apache 2.0 cores for self-host, the managed platform for one-click integration. Pick the OSS path when control is paramount; pick the managed path when you want the integrated dashboard without the ops.

What is the minimum viable OSS agent stack to ship in 2026?

Six components. One framework: LangGraph or OpenAI Agents SDK. One model serving stack: vLLM (or Ollama for local). One model: Llama 4.x or a hosted endpoint. One vector store or retriever: LlamaIndex with Qdrant or pgvector. One trace + eval back-end: FutureAGI traceAI (Apache 2.0) plus ai-evaluation (Apache 2.0). One guardrail layer: FutureAGI Protect or NeMo Guardrails. Wire all six in a Docker Compose or a Helm chart; you have a self-hostable agent platform in a week.

View all

Guide

Introducing ai-evaluation: Future AGI's Open-Source LLM Eval Library

Introducing ai-evaluation, Future AGI's Apache 2.0 Python and TypeScript library for LLM evaluation. 50+ metrics, AutoEval pipelines, streaming checks, multimodal.

Rishav Hada · May 7, 2026

14 min

Guide

Self-Improving AI Agent Pipeline in 2026 (Simulate, Eval, Optimize)

Build a self-improving AI agent pipeline in 2026: synthetic users + function-call accuracy + ProTeGi prompt rewrites. 62% to 96% accuracy on a refund agent.

Vrinda Damani · Jan 18, 2026

13 min

Guide

Instrument an AI Agent in Minutes with TraceAI in 2026

Instrument AI agents with TraceAI in 2026: OpenTelemetry-native Apache 2.0 spans, 20+ framework instrumentors, FITracer decorators, and 5-minute setup.

NVJK Kartik · Nov 30, 2025

8 min