What Is Multi-Agent Reinforcement Fine-Tuning (MARFT)?
A training method that fine-tunes the LLMs in a multi-agent system using reinforcement learning over the agents' joint trajectory and a shared reward.
What Is Multi-Agent Reinforcement Fine-Tuning (MARFT)?
Multi-Agent Reinforcement Fine-Tuning (MARFT) is a training technique that fine-tunes the LLMs inside a multi-agent system using reinforcement learning over the joint trajectory the agents produce together. Instead of training each agent on its own reward, MARFT defines a team-level reward. task completion, goal progress, cost. and back-propagates it across agents so coordination behaviour (handoffs, role respect, tool sharing) gets optimized, not just per-agent quality. It is the multi-agent generalization of RLHF for LLM agents that act in a shared environment.
Why It Matters in Production LLM and Agent Systems
Multi-agent systems fail in ways single agents cannot. A planner agent that is individually 95% accurate can still drive a team-level success rate of 60% if it hands off to the wrong specialist or duplicates a tool call. These failure modes do not show up in single-agent SFT or RLHF runs because there is no opposing agent to coordinate with. the data simply does not contain joint behaviour.
The pain lands on the team that owns the agent product. SREs see runaway cost when agents argue in a loop. Backend engineers see retries because one agent emits a payload another agent cannot parse. Product leads see escalations from end users when a “research” agent and an “executor” agent disagree about whether to act. Logs show the same pathological patterns: high agent.trajectory.step counts, low TaskCompletion despite high per-agent quality scores, and agent-handoff spans where the receiver immediately re-asks for context the sender already had.
In 2026 stacks built on LangGraph, CrewAI, AutoGen, and OpenAI Agents SDK, multi-agent topologies are now common. researcher + writer + critic, or planner + N specialists. MARFT is the answer to the question “how do I train these to actually cooperate?” without hand-crafting per-agent rewards. It is especially relevant when teams move from SFT-only fine-tuning to closed-loop optimization based on production traces.
How FutureAGI Handles MARFT
FutureAGI does not run the MARFT training loop itself. that lives in your training stack. Where FutureAGI plugs in is the reward signal and the regression check via /platform/evaluate. A typical workflow: instrument the multi-agent system with traceAI-langgraph or traceAI-openai-agents so every agent step emits OpenTelemetry spans tagged with agent.trajectory.step, the agent name, and the tool name. Feed those traces into a Dataset via Dataset.add_evaluation, and attach TrajectoryScore, TaskCompletion, and ToolSelectionAccuracy. The combined score becomes the team-level reward your MARFT loop optimizes against. instead of writing a brittle hand-coded reward, you reuse the same evaluators you already trust in production.
After the MARFT run, you do a regression eval: replay a frozen Scenario set through the pre- and post-MARFT agents using the simulate-sdk’s CloudEngine (or LiveKitEngine for voice agents), score both with the same evaluators, and diff the per-step metrics. FutureAGI’s per-cohort breakdown will tell you which agent’s policy actually moved. sometimes the planner improves but the executor regresses, and only step-level evals expose that. This is also where Agent Command Center’s traffic-mirroring is useful: shadow-deploy the MARFT-trained team alongside the production team, mirror real traffic, and only promote when team-level TaskCompletion clears the threshold. Unlike DSPy’s compile-and-replay loop, MARFT preserves the multi-agent reward shape end-to-end.
How to Measure or Detect It
Pick signals that capture joint behaviour, not just per-agent output:
TrajectoryScore: aggregates per-step scores into one trajectory rating. your primary MARFT reward proxy.TaskCompletion: 0–1 score on whether the team finished the user’s actual goal.ToolSelectionAccuracy: catches the most common MARFT regression. one agent learning to hoard or skip tools.StepEfficiency: penalizes runs where the team eventually succeeds but burned 20 steps doing it.agent.trajectory.step(OTel): canonical span attribute; filter dashboards by it to compare pre- and post-MARFT step counts.- handoff-loop rate (dashboard signal): percentage of traces where two agents pass control back and forth more than twice.
Minimal Python:
from fi.evals import TrajectoryScore, TaskCompletion
traj = TrajectoryScore()
task = TaskCompletion()
result = traj.evaluate(input=user_goal, trajectory=trace_spans)
team_reward = (result.score + task.evaluate(input=user_goal, trajectory=trace_spans).score) / 2
| Tuning method | Signal granularity | Who it optimizes | Coordination target |
|---|---|---|---|
| SFT | Per-response labels | Single agent | None |
| RLHF | Per-response preference | Single agent | None |
| DPO | Pairwise preference | Single agent | Indirect |
| Single-agent RFT | Per-trajectory reward | Single agent | Local |
| MARFT | Joint-trajectory shared reward | Whole team | Explicit |
Public anchors most cited in 2026 MARFT papers: Anthropic’s τ-bench (≈220 tool-use tasks, frontier pass^1 ~50-60% with the multi-turn airline split below 40%) is the standard stress test for team-level reward; GAIA Level 3 (frontier <25%) and MLE-Bench (Meta, 75 Kaggle competitions, frontier ~10% medal rate) measure long-horizon planning + execution decomposition where the multi-agent split typically out-scores a single-agent baseline by 8-15 points.
MARFT reward design playbook
The single most-asked question we get on MARFT is “what should the reward actually be?” The 2026 answer that works in production: a weighted composite, anchored on outcome, shaped by step quality, and penalized by cost. A concrete starting point we recommend:
reward = 1.0 * TaskCompletion
+ 0.5 * TrajectoryScore
- 0.1 * StepCount_normalized
- 0.2 * ToolErrorRate
+ 0.3 * GoalProgress # partial credit for not-quite-done runs
The composite shape matters more than the weights. Pure outcome rewards (TaskCompletion only) are sparse and slow to converge for multi-agent teams. Pure step rewards (TrajectoryScore only) reward verbose trajectories that look fine step-by-step but never finish the task. The composite is what teaches the team to do the right thing efficiently. Cost terms keep the team from discovering that “more steps” sometimes wins. without them, MARFT can ship a 4x-more-expensive team that scores marginally higher.
The second design question is what to freeze. We usually freeze the executor’s tool definitions during MARFT. the policy can change how tools are called but not which tools exist. so the post-training regression eval is interpretable. Unlike a DSPy compile pass that rewrites prompts and tool wrappers together, MARFT preserves the tool boundary, which makes it easier to attribute regressions back to the agent policy that changed.
Common Mistakes
- Optimizing per-agent rewards and calling it MARFT. If the reward is not a function of the joint trajectory, you have parallel RLHF, not MARFT. and coordination failures will not be fixed.
- Using only end-to-end success as the reward. Sparse rewards make MARFT slow to converge; combine
TaskCompletionwithTrajectoryScoreandStepEfficiencyfor shaped feedback. - Skipping the regression eval after training. A team-level reward improvement can hide a per-agent regression; re-score every agent on a frozen Scenario set.
- Training on synthetic trajectories only. Production traces capture failure modes synthetic data misses; use traceAI to mine real handoff failures into the training set.
- Forgetting cost in the reward. Without a cost penalty, MARFT will discover that “more steps” sometimes increases the success rate and ship a 4x-more-expensive team.
- Mixing the policy and the tool surface. If MARFT can change tool definitions, every regression becomes ambiguous. Freeze the tool surface during training.
- Skipping safety eval after MARFT. A team optimized for
TaskCompletioncan learn unsafe shortcuts; rerunContentSafetyandPromptInjectionon the post-MARFT checkpoints before promotion.
Frequently Asked Questions
What is MARFT?
MARFT is a reinforcement-learning fine-tuning method that trains the LLMs inside a multi-agent system using a shared reward over the team's joint trajectory, rather than per-agent rewards.
How is MARFT different from RLHF?
RLHF tunes a single model on per-response feedback. MARFT tunes multiple agents on joint trajectories, so the credit assignment runs across handoffs, tool calls, and inter-agent communication, not just one prompt-response pair.
How do you measure if MARFT improved an agent team?
Run FutureAGI's TaskCompletion and TrajectoryScore evaluators on a held-out scenario set before and after MARFT, then compare per-step ToolSelectionAccuracy to find which agent's policy actually moved.