The Open-Source AI Agent Stack in 2026: LangChain, LangGraph, CrewAI, AutoGen, Agents SDK, MS Agent Framework, Mastra Plus FAGI traceAI Apache 2.0
Open-source AI agent stack 2026: LangGraph, CrewAI, AutoGen, OpenAI Agents SDK, MS Agent Framework, Mastra, plus FAGI traceAI + ai-evaluation OSS.
Table of Contents
An engineering team builds an AI agent for internal document Q&A. They wire LangGraph, LlamaIndex with Qdrant, Ollama running Llama 3.1, and ship to a small pilot. Six weeks in, an executive asks two questions: which tool calls are failing, and how do we know the answers are not hallucinated. The team has no answer for either. The framework runs; the model serves; the retriever retrieves; there is no trace layer and no eval layer. This is the 2025-vintage open-source AI agent stack: four of the six layers, missing the two that matter most for production. This post is the 2026 picture: the six-layer reference architecture, the OSS picks for each layer, and how FutureAGI’s Apache 2.0 trace and eval libraries close the gap.
TL;DR: The six-layer OSS agent stack in 2026
| Layer | OSS picks (production-grade) | Pick when |
|---|---|---|
| Trace and evaluation | FutureAGI traceAI (Apache 2.0), FutureAGI ai-evaluation (Apache 2.0), OpenInference, Phoenix, Langfuse OSS | Always. Non-optional in 2026. |
| Tooling and integration | MCP servers, LangChain tools, custom function calling | Whenever the agent talks to anything external |
| Retrieval and memory | LlamaIndex, Haystack, Qdrant, Milvus, Weaviate, Chroma, Mem0, Letta | RAG, multi-turn agents, persistent memory |
| Agent framework | LangGraph, CrewAI, AutoGen, OpenAI Agents SDK, Microsoft Agent Framework, Mastra, Pydantic AI | Always |
| Model layer | Llama 4.x, Mistral, Qwen 3, DeepSeek-V3, Mixtral | Self-hosted or weight-portable |
| Model serving | vLLM, SGLang, Ollama, TensorRT-LLM, OpenLLM | When you serve the model yourself |
If you only read one row: the trace + eval layer is the one most 2025-era stacks miss. FutureAGI traceAI and ai-evaluation are Apache 2.0 and the recommended OSS pick for that layer regardless of which framework you build on top of.
Layer 1: Model serving
What it does: turns a model checkpoint into a callable HTTP endpoint with batching, KV cache, and latency control.
vLLM
The PagedAttention server. State of the art for throughput; production-grade. Continuous batching, KV-cache management, quantization (AWQ, GPTQ, FP8), tensor parallelism. The default when you serve open models at scale.
Repo: https://github.com/vllm-project/vllm
SGLang
The fast structured-output server. Strong on JSON-mode and tool-call workloads, often faster than vLLM on tool-using agents thanks to RadixAttention.
Repo: https://github.com/sgl-project/sglang
Ollama
Local-first, friction-free. The pick for dev laptops, edge boxes, and small deployments. Not production-throughput-grade but excellent UX.
Repo: https://github.com/ollama/ollama
TensorRT-LLM
NVIDIA’s heavily optimized stack. The fit when you have NVIDIA hardware and need every last token per second.
Docs: https://github.com/NVIDIA/TensorRT-LLM
OpenLLM
BentoML’s serving framework. Unified deployment surface across cloud providers; weaker on raw throughput but stronger on ops integration.
Repo: https://github.com/bentoml/OpenLLM
Pick: vLLM for scale, Ollama for local, SGLang when tool-call latency matters most.
Layer 2: The model
Open-weight models in 2026 are competitive with frontier closed models on most tasks. The picks:
Llama 4.x (Meta)
Meta’s 2025-2026 release. Scout (17B active / 109B total MoE) and Maverick (17B active / 400B total MoE) variants. Strong on multimodal and long context (1M+ tokens in the larger variants). The default for general-purpose open agents.
Mistral and Mixtral (Mistral AI)
Mistral Large 2, Mistral Small 3, Mixtral 8x22B. Strong on European-language workloads and tool-calling. Apache 2.0 licensing on most variants.
Qwen 3 (Alibaba)
The Qwen 3 family released late 2025, covering 0.5B to 235B-A22B (MoE). Strong on coding, multilingual, and Asian-language workloads. Apache 2.0 on smaller variants.
DeepSeek-V3 and DeepSeek-R1 (DeepSeek)
Strong on reasoning and math. DeepSeek-R1 popularized open reasoning-trace models in early 2025. MoE architecture; cost-effective per token.
Gemma 3 (Google)
Google’s open-weight family, 2B to 27B. Strong on instruction following at small scale; the pick for resource-constrained deployments.
Pick: Llama 4.x for general agents, Qwen 3 for coding, DeepSeek-V3 for reasoning-heavy workloads, Mistral for European compliance.
Layer 3: The agent framework
This is where the most movement happened in 2025-2026.
LangGraph (LangChain)
Stateful, graph-based agent runtime. Built on LangChain, but the runtime contract is graph-of-nodes with explicit state. The production default for stateful agents in 2026.
Strengths: native persistence, step-by-step debugging, OTel tracing via traceAI, mature human-in-the-loop primitives.
Repo: https://github.com/langchain-ai/langgraph
CrewAI
Role-based multi-agent. Define a researcher agent, a writer agent, a reviewer agent; CrewAI handles delegation, sequencing, and state. Low ceremony; strong for content-pipeline and analyst-style workflows.
Repo: https://github.com/crewAIInc/crewAI
AutoGen (Microsoft)
Conversational multi-agent. AutoGen’s pattern is group chat between agents with roles. The 2024 version is the reference; the Microsoft Agent Framework (2025) is the modern successor.
Repo: https://github.com/microsoft/autogen
Microsoft Agent Framework
Microsoft’s 2025 unification of AutoGen and Semantic Kernel agent primitives. Stable runtime for multi-agent dispatch with .NET and Python bindings. The fit for Microsoft-stack enterprises.
Repo: https://github.com/microsoft/agent-framework
OpenAI Agents SDK
The OpenAI-native option. Tool-call loop built on the OpenAI Responses API. Lightest-weight; fastest path from notebook to production for OpenAI-only stacks.
Repo: https://github.com/openai/openai-agents-python
Mastra
TypeScript-first agent framework. The pick for Node.js codebases that do not want to bolt onto Python. Strong on workflow primitives, integrations with TS observability.
Repo: https://github.com/mastra-ai/mastra
Pydantic AI
Typed Python agent framework on top of Pydantic. The fit for Python codebases that already use Pydantic for everything; native type safety on agent inputs and outputs.
Repo: https://github.com/pydantic/pydantic-ai
Pick: LangGraph for stateful production, CrewAI for role-based, Agents SDK for OpenAI-native, Mastra for TS, Pydantic AI for typed Python. The framework is replaceable; the trace + eval layer behind it is the constant.
For a deeper comparison, see Best Multi-Agent Frameworks 2026 and OSS Agent Frameworks 2026.
Layer 4: Retrieval and memory
LlamaIndex
The retrieval-first framework. Loaders for hundreds of data sources, index types beyond plain vector search (knowledge graphs, hierarchical, hybrid), AgentWorkflow for retrieval-led agents.
Repo: https://github.com/run-llama/llama_index
Haystack
deepset’s framework. Strong on enterprise RAG pipelines, document processing, evaluation primitives baked in.
Repo: https://github.com/deepset-ai/haystack
Vector databases
The picks: Qdrant (fast, Rust-native, strong on filters), Milvus (Zilliz, scale-out indexing), Weaviate (hybrid search + GraphQL), Chroma (lightweight, dev-first), pgvector (Postgres-native, ops-friendly).
For deep comparison see Best Vector Databases for RAG 2026.
Memory frameworks
Mem0, Letta (formerly MemGPT), Zep are the picks for persistent agent memory. Each handles the memory.retrieve span differently; pair with FutureAGI traceAI to score retrieval quality per span.
Pick: LlamaIndex when retrieval is the primary primitive; Haystack for enterprise document pipelines; Qdrant or pgvector for the vector layer; Mem0 or Letta when persistent memory matters.
Layer 5: Tooling and integration
Model Context Protocol (MCP)
The 2026 standard for connecting agents to tools and data sources. Anthropic-originated, now supported by OpenAI, Google, and the major frameworks. Python and TypeScript SDKs at https://github.com/modelcontextprotocol.
The pattern: MCP servers expose tools, resources, and prompts; MCP clients (any LLM agent) call them. The benefit is tool portability: one MCP server works with any framework that speaks MCP.
LangChain tools
The pre-MCP standard. Hundreds of pre-built integrations. Still widely used because the LangChain ecosystem is the largest; new tools tend to be exposed both as LangChain tools and as MCP servers.
Custom function calling
The lowest-level option. Define a function schema, register it with the framework, the model emits structured calls. Every major framework supports this directly.
Pick: MCP for tools you want to be portable across frameworks; LangChain tools when the integration already exists; custom function calling for proprietary tools.
Layer 6: Trace and evaluation (the most-skipped, highest-impact layer)
FutureAGI traceAI (Apache 2.0)
The OTel-based instrumentation layer for AI agents. Ships instrumentors for every major framework: LangChain, LangGraph, CrewAI, AutoGen, OpenAI Agents SDK, LlamaIndex, Haystack, OpenAI, Anthropic, Google, Mistral, and more. One line of register() enables tracing; spans flow into FutureAGI’s backend or any OTel-compatible store.
License verified Apache 2.0 at https://github.com/future-agi/traceAI/blob/main/LICENSE.
FutureAGI ai-evaluation (Apache 2.0)
The 50+-metric eval library (72 local metrics in the 1.1 release). Templates for faithfulness, context_relevance, hallucination, instruction_following, brand_tone, custom LLM-as-judge. Use as a Python import (local metrics, sub-second) or as cloud evaluators (turing_flash ~1-2s, turing_small ~2-3s, turing_large ~3-5s) with different latency-quality tradeoffs.
Repo: https://github.com/future-agi/ai-evaluation
OpenInference (Apache 2.0)
Arize’s instrumentation convention. Compatible with traceAI; the two libraries emit similar span shapes. Pick traceAI for the broader framework coverage and FutureAGI back-end integration; pick OpenInference when you are Phoenix-native.
Repo: https://github.com/Arize-ai/openinference
Phoenix (Arize)
Open-source LLM observability backend. Hosts spans, runs evaluators, serves dashboards. Apache 2.0; self-hostable. The Phoenix + OpenInference stack is the alternative if you do not use the FutureAGI back-end.
Repo: https://github.com/Arize-ai/phoenix
Langfuse (open core)
Open-source LLM observability with paid cloud tier. Trace store, prompt management, evaluation harness. The fit if you want a OSS-with-managed-option that is comparable to FutureAGI’s surface.
Repo: https://github.com/langfuse/langfuse
Pick: traceAI + ai-evaluation (FutureAGI’s Apache 2.0 stack) for the broadest framework coverage and the most complete eval template library. OpenInference + Phoenix for the Arize-native path. Langfuse for an alternative open-core dashboard.
Reference architecture: one assembled stack
A working 2026 OSS agent stack:
# docker-compose.yml (sketch)
# Real example for an OSS agent stack with FAGI trace + eval
services:
vllm:
image: vllm/vllm-openai:latest
command: --model meta-llama/Llama-4-Scout-17B-16E-Instruct
ports: ["8000:8000"]
qdrant:
image: qdrant/qdrant:latest
ports: ["6333:6333"]
agent:
build: ./agent # LangGraph + LlamaIndex + traceAI instrumented
environment:
FI_API_KEY: ${FI_API_KEY}
FI_SECRET_KEY: ${FI_SECRET_KEY}
VLLM_BASE: http://vllm:8000/v1
QDRANT_HOST: qdrant
# Optional: self-hosted FutureAGI backend, or point traceAI at the cloud
# See docs.futureagi.com for self-host instructions
Inside agent/:
# agent/main.py: LangGraph + traceAI + fi.evals (abbreviated wiring; assumes you define
# AgentState, retrieve_node, generate_node, and user_query for your own corpus and model)
# pip install "traceAI-langchain[langgraph]" ai-evaluation llama-index qdrant-client
from fi_instrumentation import register
from traceai_langchain import LangChainInstrumentor # covers LangGraph via [langgraph] extra
from fi.evals import evaluate
# 1. Wire OTel tracing for LangChain/LangGraph
register(project_name="oss-agent-stack")
LangChainInstrumentor().instrument()
# 2. Define the agent (LangGraph). AgentState/retrieve_node/generate_node are user-defined
from langgraph.graph import StateGraph
graph = StateGraph(AgentState)
graph.add_node("retrieve", retrieve_node)
graph.add_node("generate", generate_node)
graph.add_edge("retrieve", "generate")
# 3. Run the agent
result = graph.invoke({"query": user_query})
# 4. Score the result with fi.evals
score = evaluate(
"faithfulness",
output=result["answer"],
context="\n".join(result["chunks"]),
)
print(f"Faithfulness: {score.score:.3f}")
That gives you the wiring for an OSS agent: framework, retrieval, model serving, and tracing/eval hooks, in roughly 100 lines and a Docker Compose sketch. The compose file above is partial: traceAI by default ships spans to the Future AGI cloud (via FI_API_KEY / FI_SECRET_KEY) or to your own OTel backend if you set the standard OTel env vars. Add a guardrail layer (FAGI Protect, or NeMo Guardrails for the OSS-only path) and you have a production starter.
Why pick OSS over hosted in 2026
Three reasons.
First, control. The OSS stack runs in your VPC. No data leaves. Audits are straightforward. The compliance story for HIPAA, SOC 2, EU AI Act, and sectoral rules is materially simpler when the data plane is yours.
Second, cost predictability. Hosted platforms charge per seat or per event; OSS charges per CPU/GPU. For high-traffic workloads, the unit economics flip in OSS’s favor past a certain scale.
Third, portability. Frameworks come and go. Two years ago everyone was on raw LangChain; today LangGraph and Agents SDK are the production runtimes. The OTel + OpenInference convention and the eval-attached-span pattern are the stable layer; the framework is replaceable.
The risk is operational. Someone has to run vLLM. Someone has to keep the trace store on disk. Someone has to update the framework versions and re-run the eval suite. The hosted platforms (including FutureAGI’s managed service) exist for teams that want the same Apache 2.0 trace + eval back-end without the ops burden.
The truthful 2026 framing: the OSS components are production-ready. The integration is the cost. FutureAGI’s value proposition for OSS-first teams is the Apache 2.0 cores; for teams that want one-click integration, the managed platform.
Failure modes to avoid
Skipping the trace + eval layer
The most common 2025-era mistake, and still common in 2026. A stack with framework + model + retrieval but no traces and no evals is a black box. The first production incident is unrecoverable.
The fix: wire traceAI on day one. Even before the agent works, the spans flow. When the agent breaks, you have the trace.
Coupling to one framework
A stack that uses LangChain primitives everywhere is hard to migrate when LangGraph or Agents SDK is a better fit. Keep the trace + eval layer framework-agnostic (it is, when you use traceAI’s instrumentors) and treat the framework as replaceable.
Self-hosting too much too fast
Running vLLM at scale is not trivial. Running a vector DB at scale is not trivial. Running an observability backend at scale is not trivial. Start by self-hosting the trace + eval (highest control value) and using managed endpoints for model serving and vector DB. Move to self-host as scale and confidence grow.
Mixing OTel conventions
traceAI emits OpenInference-shaped spans; some vendor SDKs emit their own. Mixing produces dashboards that report different metrics from the same agent run. Standardize on one convention (OpenInference) for the lifetime of the project.
For depth on observability picks, see Best Open Source LLM Observability 2026.
Where this is going in 2027
Three trends visible in mid-2026.
First, MCP becomes the universal tool layer. Most production agents will speak MCP by default; framework-specific tool registries fade.
Second, the model layer continues to commoditize. Open-weight MoE models reach frontier capability on most tasks; the differentiation shifts to the agent runtime and the trace + eval layer.
Third, the trace + eval back-end becomes the platform. Frameworks change every 12 to 18 months; the OTel + OpenInference + eval-attached-span pattern is the constant. Investment in this layer pays compounding returns.
How to start
If you are assembling an OSS agent stack in 2026:
- Pick the framework: LangGraph for stateful, Agents SDK for OpenAI-native, CrewAI for role-based.
- Pick the retriever: LlamaIndex with Qdrant or pgvector for most cases.
- Pick the model serving: vLLM for scale, Ollama for local.
- Wire traceAI (Apache 2.0). One line per framework via the matching instrumentor.
- Attach fi.evals templates (Apache 2.0). Score every retrieve, generate, and judge span.
- Add a guardrail. FAGI Protect for the integrated path, NeMo Guardrails for OSS-only.
- Ship behind a gateway. FAGI Agent Command Center for BYOK across 100+ providers, MCP for tool routing.
The full stack runs in a Docker Compose for prototypes and a Helm chart for production. The Apache 2.0 trace + eval libraries make the observability layer portable: change framework, change model, change retriever, the spans look the same.
Sources
- LangGraph: https://github.com/langchain-ai/langgraph
- CrewAI: https://github.com/crewAIInc/crewAI
- AutoGen: https://github.com/microsoft/autogen
- Microsoft Agent Framework: https://github.com/microsoft/agent-framework
- OpenAI Agents SDK: https://github.com/openai/openai-agents-python
- Mastra: https://github.com/mastra-ai/mastra
- Pydantic AI: https://github.com/pydantic/pydantic-ai
- LlamaIndex: https://github.com/run-llama/llama_index
- Haystack: https://github.com/deepset-ai/haystack
- vLLM: https://github.com/vllm-project/vllm
- SGLang: https://github.com/sgl-project/sglang
- Ollama: https://github.com/ollama/ollama
- Model Context Protocol: https://github.com/modelcontextprotocol
- FutureAGI traceAI (Apache 2.0): https://github.com/future-agi/traceAI
- FutureAGI ai-evaluation (Apache 2.0): https://github.com/future-agi/ai-evaluation
- OpenInference (Apache 2.0): https://github.com/Arize-ai/openinference
- Phoenix: https://github.com/Arize-ai/phoenix
- Langfuse: https://github.com/langfuse/langfuse
Frequently asked questions
What is the open-source AI agent stack in 2026?
Which OSS agent framework should I pick in 2026?
What is the OSS observability layer for AI agents in 2026?
Is FutureAGI fully open source?
What about MCP? Is it part of the OSS agent stack in 2026?
Why pick OSS over a hosted agent platform?
How does the OSS agent stack compare to a closed-source platform in 2026?
What is the minimum viable OSS agent stack to ship in 2026?
Introducing ai-evaluation, Future AGI's Apache 2.0 Python and TypeScript library for LLM evaluation. 50+ metrics, AutoEval pipelines, streaming checks, multimodal.
Build a self-improving AI agent pipeline in 2026: synthetic users + function-call accuracy + ProTeGi prompt rewrites. 62% to 96% accuracy on a refund agent.
Instrument AI agents with TraceAI in 2026: OpenTelemetry-native Apache 2.0 spans, 20+ framework instrumentors, FITracer decorators, and 5-minute setup.