Mastra is an open-source TypeScript framework for building AI agents, workflows, tools, memory, and production AI applications. FutureAGI traces Mastra runs through traceAI:mastra and evaluates their agent trajectories.

How is Mastra different from LangGraph?

Mastra is TypeScript-first and fits Node, web, and server-side application stacks. LangGraph is usually used in Python-heavy LangChain systems for graph-based agent workflows.

How do you measure Mastra?

Use traceAI:mastra spans such as agent.trajectory.step, then score runs with FutureAGI evaluators including ToolSelectionAccuracy, TaskCompletion, TrajectoryScore, and StepEfficiency.

Mastra Definition, Examples & FutureAGI Guide (2026)

What Is Mastra?

Mastra is an open-source TypeScript agent framework for building AI agents, workflows, tools, memory, and evaluation-aware applications. As an agent framework, it shows up in production when a Node or web stack coordinates model calls, tool calls, MCP servers, workflow steps, and human approvals. FutureAGI treats Mastra runs as traceable agent trajectories through traceAI:mastra, so engineers can inspect each step, score task completion, and catch tool-selection or loop regressions before release.

Why Mastra matters in production LLM and agent systems

Mastra matters when a prototype agent becomes a production service with user state, tool side effects, and cost controls. A TypeScript team can ship an agent next to a Next.js app, API route, or Node service, but the risk moves from “can it answer?” to “can it act correctly, repeatedly, and within budget?” The common failures are wrong tool selection, unbounded workflow loops, stale memory retrieval, hidden model fallback, and silent prompt drift across releases.

Developers feel the first pain because framework primitives become incident surfaces. A tool approval step may block a refund, an MCP server may return a partial schema, or a workflow branch may call the model four extra times. SRE sees this as p99 latency growth, retry bursts, token-cost-per-trace spikes, and error logs split across app code and agent runtime. Product teams see users receive confident responses for tasks that were not completed.

Agentic systems make this harder than single-turn LLM calls because one Mastra run can include planning, retrieval, memory, tool execution, human approval, and final response generation. Compared with LangGraph or CrewAI, Mastra is especially relevant for JavaScript and TypeScript teams that need agent code to live inside the same deployment, auth, streaming, and frontend workflow as the rest of the product.

How FutureAGI handles Mastra

FutureAGI’s approach is to treat Mastra as a production agent runtime, not just a framework choice. In a FutureAGI workflow, traceAI:mastra instruments a Mastra application so model calls, workflow steps, tool invocations, MCP interactions, and final responses can be analyzed as one trace tree. The key field is agent.trajectory.step: it lets an engineer filter runs by planner, tool, approval gate, memory lookup, or responder step instead of reading raw logs.

Consider a TypeScript support agent built in Mastra. The agent reads a user request, retrieves order context, chooses a billing tool, asks for approval on a refund action, and writes a final response. FutureAGI links the traceAI:mastra span tree to ToolSelectionAccuracy, TaskCompletion, TrajectoryScore, and StepEfficiency. If a release makes the agent choose lookup_invoice instead of create_refund, ToolSelectionAccuracy drops before customer complaints arrive. If the agent finishes with a fluent apology but never completes the refund, TaskCompletion fails the run.

The next action is concrete. The engineer opens the failed trace, checks gen_ai.request.model, llm.token_count.prompt, tool arguments, and the repeated agent.trajectory.step values. They tighten the tool description, add a threshold for repeated steps, and run a regression eval on the affected dataset before promoting the Mastra prompt or workflow change. Agent Command Center can add model fallback or a pre-guardrail where the trace shows model or safety risk, but the primary evidence remains the traced Mastra trajectory.

How to measure or detect Mastra behavior

Measure Mastra at the framework, trace, and evaluator layers:

ToolSelectionAccuracy evaluates whether the Mastra agent chose the right tool for the user goal and step context.
TaskCompletion checks whether the full Mastra run achieved the requested outcome, not only whether the final text sounded plausible.
TrajectoryScore summarizes the ordered agent path across planning, tools, observations, and final response.
StepEfficiency flags runs that take unnecessary hops, repeat workflow steps, or loop around a missing stop condition.
agent.trajectory.step identifies skipped, repeated, or reordered Mastra workflow steps across releases.
token-cost-per-trace and p99 latency by step expose cost and latency regressions in multi-step agent runs.
escalation rate after tool use catches cases where users must ask a human to fix an agent action.

Minimal evaluator sketch:

from fi.evals import ToolSelectionAccuracy, TaskCompletion

tool_score = ToolSelectionAccuracy().evaluate(
    input=user_goal,
    response=chosen_tool,
)
task_score = TaskCompletion().evaluate(input=user_goal, response=final_answer)
print(tool_score.score, task_score.score)

Common mistakes

Treating Mastra traces as plain logs. Logs show events; traces preserve parent-child order, span timing, model fields, and workflow structure.
Scoring only the final response. A polished answer after the wrong tool call is still a failed agent run.
Changing step names every release. Unstable agent.trajectory.step values break cohort comparisons, alerts, and regression dashboards.
Ignoring tool approval failures. A safe approval gate can still create a product failure if it blocks necessary actions without fallback behavior.
Mixing framework and business errors. Separate Mastra runtime failures from domain failures so eval thresholds point to the right owner.