What Is the ML Stack?
The chosen set of tools that implement an ML or LLM architecture, covering training, serving, retrieval, gateway, evaluation, and observability.
What Is the ML Stack?
The ML stack is the set of tools a team picks to implement an ML or LLM architecture. It covers training frameworks, inference engines, retrieval systems, orchestration, gateways, evaluation, observability, and data pipelines. The stack is distinct from architecture (the design of components and edges) and from infrastructure (the compute, network, and storage layer). For LLM and agent systems, a 2026-era stack typically includes Hugging Face, vLLM, LangChain or LlamaIndex, a vector store, an LLM gateway, and FutureAGI for evaluation and tracing.
Why It Matters in Production LLM/Agent Systems
Stack drift is one of the most common silent reliability risks. A team adopts a new vector store for cost reasons, but ContextRelevance was never re-baselined. A new orchestration library changes retry behavior, and tool-timeout incidents creep upward. A model serving runtime swap changes tokenization, so prompts drift in unexpected ways. None of these are bad choices in isolation; they become incidents when the stack is changed without measured comparison.
Developers see the pain when a familiar bug reappears after a stack swap. SREs see new failure modes such as cold-start latency, queue saturation, or unexpected retries. Product teams see inconsistent behavior across routes that share an architecture but use different stack components. Security teams see new dependency risk after each addition; the LiteLLM compromise from earlier in 2026 showed how stack pinning, audit logs, and gateway choices matter beyond performance.
Agentic systems multiply stack surface area. A planner may use one framework, tool execution another, retrieval a third, and generation a fourth. In 2026, the introduction of Model Context Protocol and Agent2Agent Protocol added more stack edges: MCP servers, A2A endpoints, identity propagation, and tool registries all become parts of the stack to track. A useful ML stack is the one whose every component emits traces and supports evaluation.
How FutureAGI Handles the ML Stack
The anchor for this glossary term is none: the ML stack is a portfolio of tools, not a single FutureAGI evaluator. FutureAGI’s approach is to plug into the stack at the layers that decide reliability. traceAI integrations cover LangChain, LlamaIndex, the OpenAI Agent SDK, CrewAI, AutoGen, Pydantic AI, and others, so the stack emits OTel-compatible spans with agent.trajectory.step, llm.token_count.prompt, and tool-call attributes. Agent Command Center sits in front of the stack as the gateway: routing policy: cost-optimized, semantic-cache, pre-guardrail, post-guardrail, model fallback, and traffic-mirroring are first-class primitives.
A real workflow begins when an LLM team adopts a new vector store. Before flipping traffic, they run traffic-mirroring from the gateway and grade both the old and new path with ContextRelevance and Groundedness against the same dataset rows. Cost is graded with llm.token_count.prompt and route-level cost dashboards. If the new vector store improves latency but ContextRelevance drops on policy-rewrite rows, the stack change is held until the retriever is tuned. Unlike Weights and Biases, which centers experiment tracking, FutureAGI grades the live stack against row-level eval evidence.
How to Measure or Detect It
Measure ML stack health as a layered set of signals tied to specific tools:
- Eval-fail-rate per stack component: split failures across retrieval, generation, gateway, and tool layers.
- Span coverage per stack tool: percent of components that emit
traceAIspans with required fields. gen_ai.server.time_to_first_token: serving-layer responsiveness; alert by route and by underlying engine.- Cost per trace: aggregate token cost mapped back to gateway routes and stack components.
- Cache and fallback events: rate of
semantic-cachehits,model fallback, and retry events per stack path. - Dependency posture: pinned versions of orchestration, gateway, and serving components, plus audit-log coverage.
from fi.evals import Groundedness
metric = Groundedness()
result = metric.evaluate(response=answer, context=context)
print(stack_path, trace_id, ttft_ms, result.score)
Common Mistakes
- Equating stack with architecture. Listing “PyTorch, vLLM, LangChain, Pinecone” is not an architecture; without component edges and contracts, you have a tool list.
- Swapping stack tools without baselines. A new vector store, gateway, or framework needs ContextRelevance and Groundedness baselines before traffic shift.
- Ignoring orchestration retry semantics. Different agent frameworks treat retries, timeouts, and tool failures differently; this is where most agent-loop bugs hide.
- Coupling stack pieces tightly. Hard-coding a specific provider into application logic blocks gateway-level fallbacks and traffic mirroring.
- Skipping audit logs at the gateway layer. Without
gateway:audit-logs, stack-level incidents lose their forensic timeline.
Frequently Asked Questions
What is the ML stack?
The ML stack is the set of tools that implement a machine learning or LLM architecture. It covers training, serving, retrieval, orchestration, gateways, evaluation, and observability, and it is distinct from the architecture design and the underlying infrastructure.
How is the ML stack different from ML architecture?
ML architecture is the component and data-flow design. The ML stack is the toolchain that implements that design. Two teams can share one architecture and pick different stacks; one team can swap stack choices without changing the architecture.
How do you evaluate an ML stack choice?
FutureAGI evaluates a stack choice by attaching evaluators such as Groundedness, ContextRelevance, and TaskCompletion to traces produced by the stack and tracking p99 latency, fallback rate, and eval-fail-rate per route across `traceAI` integrations.