Spring AI is a Java and Spring framework for building LLM applications and agents with ChatClient, advisors, tool calling, structured output, vector stores, and provider abstractions.

How is Spring AI different from LangChain?

LangChain is a cross-language agent and LLM application framework. Spring AI is built for Java teams already using Spring services, dependency injection, configuration, and production deployment patterns.

How do you measure Spring AI reliability?

FutureAGI measures Spring AI through traceAI:spring-ai spans, agent.trajectory.step fields, and evaluators such as ToolSelectionAccuracy, TaskCompletion, Groundedness, and ContextRelevance.

What Is Spring AI? FutureAGI Guide (2026)

What Is Spring AI?

Spring AI is a Java and Spring framework for building LLM applications and agent workflows with ChatClient, advisors, tool calling, structured outputs, vector stores, and provider abstractions. It is an agent framework term because it controls how Spring services ask models for work, retrieve context, call tools, and return typed results. In production, it shows up as Java application traces, model spans, advisor chains, and tool-call decisions. FutureAGI evaluates those traces through traceAI-spring-ai for task success, grounding, cost, and unsafe actions.

By 2026, Spring AI is the default LLM framework for Java teams already running Spring Boot in production. It supports the major frontier providers (OpenAI, Anthropic, Google, Bedrock), the major vector stores, and increasingly MCP as a tool-discovery protocol. The competition for Spring teams is mostly LangChain4j; outside Java, the conversation is LangGraph or Strands Agents.

Why Spring AI matters in production LLM and agent systems

Spring AI hides LLM complexity inside normal Spring services, which is useful for adoption and risky when an AI call starts taking actions. A support service can attach a default tool to every ChatClient, let a QuestionAnswerAdvisor retrieve stale policy context, or return a Java entity that deserializes while violating a business rule. The downstream failure looks like an ordinary application bug: wrong refund status, wrong email, high model bill, or repeated tool calls.

Developers feel the pain when local prompts pass but production runs diverge after a new advisor, tool, model, or system prompt ships. SREs see p99 latency climb because a ChatClient path calls a slow tool twice. Product teams see completion rates drop in one cohort. Compliance teams ask why a tool was exposed to a prompt that should have been read-only.

The symptoms usually appear as rising token-cost-per-trace, repeated agent.trajectory.step values, tool-timeout spikes, higher retry counts, advisor-chain changes, or low eval scores after a Java release. This matters more in 2026 multi-step systems because Spring AI often sits inside transaction-heavy Java services with vector stores, MCP servers, task queues, and gateway policies. Unlike a LangSmith-style manual trace review, a Spring AI production workflow needs repeatable scoring on the trace itself.

How FutureAGI handles Spring AI

FutureAGI’s approach is to treat Spring AI as an execution surface for Java agents, not just a wrapper around model APIs. With traceAI-spring-ai, a ChatClient call can be captured as a trace containing model spans, advisor activity, vector-store retrieval, tool calls, status, latency, and token fields such as llm.token_count.prompt. The agent path is tracked with fields like agent.trajectory.step, so engineers can see whether a Spring service chose the right tool, looped, or answered from weak context.

Spring AI primitive	What can go wrong	Evaluator
ChatClient	Wrong system prompt, wrong model	`TaskCompletion`
Advisor chain	Memory before retrieval, vice versa	`ContextRelevance`
Tool/Function	Wrong tool, wrong arguments	`ToolSelectionAccuracy`
VectorStore retrieval	Stale or off-topic chunks	`ContextRelevance`, `Groundedness`
Structured output	Schema-valid but semantically wrong	`CustomEvaluation`
MCP tool import	Untrusted server, schema drift	`PromptInjection`, `ProtectFlash`

Example: a banking service uses Spring AI to answer “Where is my refund?” The ChatClient applies a memory advisor, retrieves account-policy context, calls refundStatus, and returns a typed response object. FutureAGI scores the trace with ToolSelectionAccuracy for the selected tool, TaskCompletion for the final outcome, ContextRelevance for retrieved policy text, and Groundedness for whether the response stays supported by that context. If the wrong tool fires, the trace becomes a regression case tied to the Java build, prompt version, and advisor order.

In our 2026 evals, the hardest Spring AI failures are rarely model-only failures; they are orchestration mismatches between advisor order, tool exposure, and typed output contracts. An engineer can alert on eval-fail-rate-by-cohort, route risky prompts through an Agent Command Center pre-guardrail, trigger model fallback when latency or failures breach policy, or block a deployment when task completion falls below threshold. The public benchmark anchors most Java teams pace against are BFCL v3 (Berkeley Function Calling Leaderboard, frontier 88-94% in May 2026; useful for grading the Spring @Tool selection path) and τ-bench (Anthropic’s multi-turn customer-support benchmark, frontier 55-70%; mirrors the ChatClient + advisor + tool pattern most Spring AI services run).

How to measure Spring AI reliability

Measure Spring AI by scoring both the final Java-facing result and the intermediate path that produced it.

traceAI-spring-ai spans. show ChatClient calls, tool decisions, advisor effects, latency, status, and model usage for Java services.
ToolSelectionAccuracy. evaluates whether the selected Spring tool matched the user intent and expected action.
TaskCompletion. scores whether the workflow completed the requested task, not only whether the final text sounded plausible.
Groundedness and ContextRelevance. catch RAG failures from weak vector-store retrieval or unsupported final claims.
Dashboard signals. p99 latency, token-cost-per-trace, eval-fail-rate-by-cohort, tool-timeout rate, repeated agent.trajectory.step, and escalation rate.

from fi.evals import ToolSelectionAccuracy, TaskCompletion, Groundedness

tool_eval = ToolSelectionAccuracy()
task_eval = TaskCompletion()
ground_eval = Groundedness()

tool_score = tool_eval.evaluate(trajectory=trace_spans, expected_tool="refundStatus")
task_score = task_eval.evaluate(input=user_request, output=final_response)
ground_score = ground_eval.evaluate(output=final_response, context=retrieved_chunks)
print(tool_score.score, task_score.score, ground_score.score)

Common mistakes

The common Spring AI mistakes are production wiring mistakes, not syntax errors:

Registering default tools too broadly. A shared ChatClient builder can expose write tools to read-only flows unless request-level tools override them.
Treating ChatResponse metadata as observability. Token counts and model metadata do not explain why an advisor retrieved bad context.
Testing only controller outputs. Spring MVC tests can pass while advisor ordering, vector-store recall, or tool parameters fail under real traces.
Letting Java type conversion hide semantic errors. An entity() result can deserialize while violating the policy, amount, or action contract.
Ignoring advisor order. Memory before retrieval versus retrieval before memory changes the context the model sees and the score it earns.
Pulling tools from untrusted MCP servers. A third-party MCP server is supply-chain code; treat it the same way as a third-party Maven dependency.
Skipping VectorStore re-indexing after schema changes. A Spring service that adds a field to its source documents needs the index rebuilt, not just the chunker re-run.
Forgetting to scope the @Tool annotations. A Spring @Tool left at request scope can be exposed broader than intended; pin scope and visibility explicitly.