RAGEN is an open-source reinforcement-learning framework for training LLM agents in multi-turn stochastic environments, built around the StarPO algorithm for full-trajectory policy optimization.

How is RAGEN different from RLHF?

RLHF optimises single-turn responses against a preference reward. RAGEN optimises full multi-step trajectories — state, thinking, action, reward — for agents that interact with environments over many turns, where credit assignment is the hard problem.

How does FutureAGI fit into a RAGEN training loop?

RAGEN trains the agent; FutureAGI evaluates it. Run TaskCompletion, TrajectoryScore, and ReasoningQuality on the trained checkpoint against your golden dataset, and use the eval delta to gate every training run.

What Is RAGEN? LLM Agent Training System Definition (2026)

What Is RAGEN?

RAGEN is an open-source framework for training LLM-based agents using reinforcement learning in multi-turn, stochastic environments. Maintained as a research artifact, RAGEN formalises agent-environment interaction as a Markov Decision Process and introduces the StarPO algorithm — State-Thinking-Action-Reward Policy Optimization — which optimises over entire trajectories rather than single turns. It also ships stability mechanisms (RAGEN-D, RAGEN-S) that prevent the “echo trap” failure where self-play agents collapse into degenerate policies. RAGEN sits in the agent-training corner of the LLM stack, adjacent to but distinct from production RAG inference systems.

Why It Matters in Production LLM and Agent Systems

Most teams shipping agents in 2026 do not train their own. They prompt-engineer or fine-tune lightly on a base model and call it good. But the frontier teams — voice-AI, coding agents, computer-use agents — increasingly need RL post-training to bake task-specific behaviour into the policy. The challenge is that agent environments are multi-turn and stochastic: tool outputs change, environments react, and credit assignment is hard. Naive RLHF, designed for single-turn preference data, fails in these settings.

The pain is concrete. A coding-agent team trains a reward model on single-turn code-fix preferences and finds the resulting agent loops on the same broken plan. A voice-agent team applies DPO to multi-turn conversation traces and discovers the agent has memorised a single high-reward trajectory and lost diversity. A research team running self-play on a tool-using agent watches the policy collapse into “always call this one tool” because the reward landscape was sparse and the agent found a local optimum that was an echo of itself.

RAGEN exists to solve those failure modes specifically — multi-turn credit assignment, trajectory-level optimization, and stability under self-play. For teams with the GPU budget, it is one of the few open-source pipelines designed for the agent-training problem instead of being borrowed from single-turn RLHF.

How FutureAGI Handles RAGEN-Trained Agents

FutureAGI doesn’t train agents — we evaluate them. The contract with a RAGEN training loop is clean: RAGEN produces a checkpoint, FutureAGI scores it. The team registers the trained checkpoint, points a Dataset of held-out tasks at it, and attaches TaskCompletion, TrajectoryScore, ReasoningQuality, StepEfficiency, and ToolSelectionAccuracy via Dataset.add_evaluation(). The output is a per-checkpoint scorecard. Compare against the prior checkpoint and against the SFT-only baseline; gate the release on regression deltas.

Concretely: a team training a multi-turn coding agent with RAGEN runs each StarPO checkpoint through FutureAGI’s golden dataset of 500 coding tasks. TaskCompletion rises from 64% on SFT-only to 81% on the StarPO checkpoint. TrajectoryScore rises but StepEfficiency falls — the new policy is more capable but uses more steps per task. eval-fail-rate-by-cohort reveals the gain is concentrated on debugging tasks (which involve multi-turn exploration) and not on greenfield code generation. The team ships StarPO for the debugging route via Agent Command Center conditional routing, keeps SFT-only on the simpler route, and uses the eval signal to drive the next training cycle. FutureAGI is the eval scaffolding around RAGEN, not a replacement for it.

How to Measure or Detect It

RAGEN-trained agent checkpoints need trajectory-level measurement, not single-turn:

TaskCompletion: 0–1 score for whether the agent reached the user’s goal across the full trajectory.
TrajectoryScore: composite trajectory rating; the canonical RAGEN-paired metric.
ReasoningQuality: scores whether the agent’s chain-of-thought is logically valid given observations.
StepEfficiency: catches the “more capable but wasteful” regression that RL can introduce.
ToolSelectionAccuracy: scores tool choices step-by-step.
Per-checkpoint regression delta: alert on >2% drop on any of the above vs. the prior checkpoint.

from fi.evals import TaskCompletion, TrajectoryScore, StepEfficiency

task = TaskCompletion()
trajectory = TrajectoryScore()
efficiency = StepEfficiency()

result = trajectory.evaluate(trajectory=trace_spans, goal=user_goal)
print(result.score, result.reason)

Common Mistakes

Optimising RAGEN reward without out-of-distribution eval. Self-play converges; held-out eval is what reveals whether the policy generalises.
Skipping StepEfficiency. RL policies can be capable and wasteful; track step count alongside completion.
Reward gaming. If your reward model has gaps, RAGEN finds them — eval against a separate evaluator suite, not the reward model.
Ignoring the echo trap. Without RAGEN-D / RAGEN-S stabilization, self-play collapses; track diversity metrics on the trajectory pool.
Comparing RAGEN against single-turn RLHF on single-turn benchmarks. That is not the regime RAGEN was designed for; use multi-turn trajectory benchmarks.