MARFT is a reinforcement-learning fine-tuning method that trains the LLMs inside a multi-agent system using a shared reward over the team's joint trajectory, rather than per-agent rewards.

How is MARFT different from RLHF?

RLHF tunes a single model on per-response feedback. MARFT tunes multiple agents on joint trajectories, so the credit assignment runs across handoffs, tool calls, and inter-agent communication, not just one prompt-response pair.

How do you measure if MARFT improved an agent team?

Run FutureAGI's TaskCompletion and TrajectoryScore evaluators on a held-out scenario set before and after MARFT, then compare per-step ToolSelectionAccuracy to find which agent's policy actually moved.

What Is MARFT? Multi-Agent RL Fine-Tuning Explained (2026)

What Is Multi-Agent Reinforcement Fine-Tuning (MARFT)?

Multi-Agent Reinforcement Fine-Tuning (MARFT) is a training technique that fine-tunes the LLMs inside a multi-agent system using reinforcement learning over the joint trajectory the agents produce together. Instead of training each agent on its own reward, MARFT defines a team-level reward — task completion, goal progress, cost — and back-propagates it across agents so coordination behaviour (handoffs, role respect, tool sharing) gets optimized, not just per-agent quality. It is the multi-agent generalization of RLHF for LLM agents that act in a shared environment.

Why It Matters in Production LLM and Agent Systems

Multi-agent systems fail in ways single agents cannot. A planner agent that is individually 95% accurate can still drive a team-level success rate of 60% if it hands off to the wrong specialist or duplicates a tool call. These failure modes do not show up in single-agent SFT or RLHF runs because there is no opposing agent to coordinate with — the data simply does not contain joint behaviour.

The pain lands on the team that owns the agent product. SREs see runaway cost when agents argue in a loop. Backend engineers see retries because one agent emits a payload another agent cannot parse. Product leads see escalations from end users when a “research” agent and an “executor” agent disagree about whether to act. Logs show the same pathological patterns: high agent.trajectory.step counts, low TaskCompletion despite high per-agent quality scores, and agent-handoff spans where the receiver immediately re-asks for context the sender already had.

In 2026 stacks built on LangGraph, CrewAI, AutoGen, and OpenAI Agents SDK, multi-agent topologies are now common — researcher + writer + critic, or planner + N specialists. MARFT is the answer to the question “how do I train these to actually cooperate?” without hand-crafting per-agent rewards. It is especially relevant when teams move from SFT-only fine-tuning to closed-loop optimization based on production traces.

How FutureAGI Handles MARFT

FutureAGI does not run the MARFT training loop itself — that lives in your training stack. Where FutureAGI plugs in is the reward signal and the regression check. A typical workflow: instrument the multi-agent system with traceAI-langgraph or traceAI-openai-agents so every agent step emits OpenTelemetry spans tagged with agent.trajectory.step, the agent name, and the tool name. Feed those traces into a Dataset via Dataset.add_evaluation, and attach TrajectoryScore, TaskCompletion, and ToolSelectionAccuracy. The combined score becomes the team-level reward your MARFT loop optimizes against — instead of writing a brittle hand-coded reward, you reuse the same evaluators you already trust in production.

After the MARFT run, you do a regression eval: replay a frozen Scenario set through the pre- and post-MARFT agents using the simulate-sdk’s CloudEngine (or LiveKitEngine for voice agents), score both with the same evaluators, and diff the per-step metrics. FutureAGI’s per-cohort breakdown will tell you which agent’s policy actually moved — sometimes the planner improves but the executor regresses, and only step-level evals expose that. This is also where Agent Command Center’s traffic-mirroring is useful: shadow-deploy the MARFT-trained team alongside the production team, mirror real traffic, and only promote when team-level TaskCompletion clears the threshold.

How to Measure or Detect It

Pick signals that capture joint behaviour, not just per-agent output:

TrajectoryScore: aggregates per-step scores into one trajectory rating — your primary MARFT reward proxy.
TaskCompletion: 0–1 score on whether the team finished the user’s actual goal.
ToolSelectionAccuracy: catches the most common MARFT regression — one agent learning to hoard or skip tools.
StepEfficiency: penalizes runs where the team eventually succeeds but burned 20 steps doing it.
agent.trajectory.step (OTel): canonical span attribute; filter dashboards by it to compare pre- and post-MARFT step counts.
handoff-loop rate (dashboard signal): percentage of traces where two agents pass control back and forth more than twice.

Minimal Python:

from fi.evals import TrajectoryScore, TaskCompletion

traj = TrajectoryScore()
task = TaskCompletion()

result = traj.evaluate(input=user_goal, trajectory=trace_spans)
team_reward = (result.score + task.evaluate(input=user_goal, trajectory=trace_spans).score) / 2

Common Mistakes

Optimizing per-agent rewards and calling it MARFT. If the reward is not a function of the joint trajectory, you have parallel RLHF, not MARFT — and coordination failures will not be fixed.
Using only end-to-end success as the reward. Sparse rewards make MARFT slow to converge; combine TaskCompletion with TrajectoryScore and StepEfficiency for shaped feedback.
Skipping the regression eval after training. A team-level reward improvement can hide a per-agent regression; re-score every agent on a frozen Scenario set.
Training on synthetic trajectories only. Production traces capture failure modes synthetic data misses; use traceAI to mine real handoff failures into the training set.
Forgetting cost in the reward. Without a cost penalty, MARFT will discover that “more steps” sometimes increases the success rate and ship a 4x-more-expensive team.