MARFT, or multi-agent reinforcement fine-tuning, trains several LLM agents with reinforcement feedback from complete trajectories so the system can improve coordination, tool choice, handoffs, and task outcomes.

How is MARFT different from MAPoRL?

MARFT is the broader idea of reinforcement fine-tuning a multi-agent system. MAPoRL is an adjacent post-training reinforcement learning pattern focused on improving multi-agent behavior after an initial training stage.

How do you measure MARFT?

In FutureAGI, measure MARFT results with TrajectoryScore, TaskCompletion, ToolSelectionAccuracy, and trace fields such as agent.trajectory.step. Compare pre-training and post-training cohorts before release.

MARFT: Definition, Examples & FutureAGI Guide (2026)

What Is MARFT?

MARFT, or multi-agent reinforcement fine-tuning, is an agent training method that tunes several LLM agents using reinforcement feedback from full multi-step trajectories. It belongs to the agent reliability family because the unit being improved is not one answer, but a coordinated run across planners, tools, handoffs, and final outcomes. In production, FutureAGI teams see MARFT around training workflows, regression eval pipelines, and trace comparisons of pre-training versus post-training agent behavior.

Why MARFT matters in production LLM and agent systems

MARFT matters because multi-agent failures often hide between agents. A planner may set the wrong subgoal, a worker agent may choose a weak tool, a verifier may accept a partial answer, and the final coordinator may still return a fluent success message. If training rewards only the final answer, the system can learn coordination shortcuts: over-trusting one agent, skipping verification, or optimizing for short trajectories that fail on ambiguous tasks.

Developers feel this as brittle behavior after a training run that looked strong in aggregate. SREs see longer traces, retry bursts, elevated p99 latency, and token-cost-per-trace spikes when agents negotiate or repeat work. Product teams see lower task completion on edge cohorts even while average benchmark scores move up. Compliance teams care because multi-agent systems can distribute responsibility across steps, making it harder to explain why a write action, escalation, or refusal happened.

The common symptoms are disagreement between agents, repeated handoffs, rising tool-error rate, evaluator pass rates that diverge by scenario, and traces where the final answer hides a failed intermediate step. This is especially relevant for 2026-era pipelines because agents now call MCP tools, delegate to subagents, retrieve context, and route through gateways. A MARFT-trained system needs per-step evaluation, not only a reward curve from training.

How FutureAGI handles MARFT

MARFT is not a named FutureAGI product surface; it is a training method. FutureAGI’s approach is to evaluate what the training changed, then decide whether the new multi-agent policy is safer, cheaper, and more effective than the baseline. The workflow starts by logging the baseline and MARFT candidate runs as separate dataset cohorts, then comparing their traces with agent.trajectory.step so each plan, tool call, observation, handoff, and final response can be scored.

A real example: a research agent team has a planner, a retriever, a calculator, and a verifier. After MARFT, the team completes more tasks but sometimes lets the verifier skip source checks when the calculator returns a plausible number. FutureAGI evaluates the same held-out scenarios before and after training. TaskCompletion checks whether the user goal was met. ToolSelectionAccuracy checks whether each agent chose the right tool at the right step. ReasoningQuality checks whether the intermediate rationale supports the action sequence. TrajectoryScore aggregates the run-level behavior into a score that can be tracked by cohort.

Unlike single-agent RLHF or prompt optimizers such as ProTeGi and GEPA, MARFT changes the policy of a coordinated agent team. The engineer should set a release threshold such as “candidate TrajectoryScore improves by at least 5% while tool-error rate and p99 latency do not regress.” If the MARFT cohort improves task completion but fails tool selection, the fix is a regression eval or training-data adjustment, not a blind rollout.

How to measure or detect MARFT effects

MARFT is measured by its effect on agent behavior, not by a single standalone score. Track pre-training versus post-training cohorts and slice results by task type, tool, handoff, and customer segment.

TrajectoryScore - returns a run-level score for the quality of the agent trajectory across steps, actions, and outcome.
TaskCompletion - checks whether the agent team completed the requested goal, not only whether the final answer sounded correct.
ToolSelectionAccuracy - detects whether MARFT improved or harmed tool choice at each agent.trajectory.step.
Trace fields - inspect agent.trajectory.step, tool status, retry count, handoff count, latency, and token usage for each run.
Dashboard signals - compare eval-fail-rate-by-cohort, p99 latency, token-cost-per-trace, handoff count, and tool-error rate.
User proxies - monitor escalation rate, thumbs-down rate, manual override frequency, and “agent said done but failed” annotations.

Minimal evaluator check:

from fi.evals import TrajectoryScore

result = TrajectoryScore().evaluate(
    input=task,
    trajectory=marft_agent_steps,
    output=final_answer,
)
print(result.score, result.reason)

Common mistakes

The main mistake is treating MARFT as a training win before checking production behavior.

Rewarding only the final answer. Multi-agent systems can learn to hide bad intermediate steps behind a polished response.
Comparing averages without cohorts. MARFT may improve simple tasks while harming long-horizon, tool-heavy, or compliance-sensitive scenarios.
Ignoring cost and latency. Better task completion is not enough if handoff count, retries, or token-cost-per-trace move sharply.
Training on evaluator leakage. If reward signals mirror the test rubric too closely, post-training evals can overstate real reliability.
Skipping trace review. A reward curve cannot show whether the planner, worker, retriever, or verifier caused the regression.