RAG

What Is RAGEN?

An open-source reinforcement-learning framework for training LLM agents in multi-turn stochastic environments using the StarPO trajectory-optimization algorithm.

What Is RAGEN?

RAGEN is an open-source framework for training LLM-based agents using reinforcement learning in multi-turn, stochastic environments. Maintained as a research artifact, RAGEN formalises agent-environment interaction as a Markov Decision Process and introduces the StarPO algorithm. State-Thinking-Action-Reward Policy Optimization. which optimises over entire trajectories rather than single turns. It also ships stability mechanisms (RAGEN-D, RAGEN-S) that prevent the “echo trap” failure where self-play agents collapse into degenerate policies. RAGEN sits in the agent-training corner of the LLM stack, adjacent to but distinct from production RAG inference systems.

Why It Matters in Production LLM and Agent Systems

Most teams shipping agents in 2026 do not train their own. They prompt-engineer or fine-tune lightly on a base model and call it good. But the frontier teams. voice-AI, coding agents, computer-use agents. increasingly need RL post-training to bake task-specific behaviour into the policy. The challenge is that agent environments are multi-turn and stochastic: tool outputs change, environments react, and credit assignment is hard. Naive RLHF, designed for single-turn preference data, fails in these settings.

The pain is concrete. A coding-agent team trains a reward model on single-turn code-fix preferences and finds the resulting agent loops on the same broken plan. A voice-agent team applies DPO to multi-turn conversation traces and discovers the agent has memorised a single high-reward trajectory and lost diversity. A research team running self-play on a tool-using agent watches the policy collapse into “always call this one tool” because the reward landscape was sparse and the agent found a local optimum that was an echo of itself.

RAGEN exists to solve those failure modes specifically. multi-turn credit assignment, trajectory-level optimization, and stability under self-play. For teams with the GPU budget, it is one of the few open-source pipelines designed for the agent-training problem instead of being borrowed from single-turn RLHF.

How FutureAGI Handles RAGEN-Trained Agents

FutureAGI doesn’t train agents. we evaluate them. The contract with a RAGEN training loop is clean: RAGEN produces a checkpoint, FutureAGI scores it. The team registers the trained checkpoint, points a Dataset of held-out tasks at it, and attaches TaskCompletion, TrajectoryScore, ReasoningQuality, StepEfficiency, and ToolSelectionAccuracy via Dataset.add_evaluation(). The output is a per-checkpoint scorecard. Compare against the prior checkpoint and against the SFT-only baseline; gate the release on regression deltas.

Concretely: a team training a multi-turn coding agent with RAGEN runs each StarPO checkpoint through FutureAGI’s golden dataset of 500 coding tasks. TaskCompletion rises from 64% on SFT-only to 81% on the StarPO checkpoint. TrajectoryScore rises but StepEfficiency falls. the new policy is more capable but uses more steps per task. eval-fail-rate-by-cohort reveals the gain is concentrated on debugging tasks (which involve multi-turn exploration) and not on greenfield code generation. The team ships StarPO for the debugging route via Agent Command Center conditional routing, keeps SFT-only on the simpler route, and uses the eval signal to drive the next training cycle. FutureAGI is the eval scaffolding around RAGEN, not a replacement for it.

How to Measure or Detect It

RAGEN-trained agent checkpoints need trajectory-level measurement, not single-turn:

  • TaskCompletion: 0–1 score for whether the agent reached the user’s goal across the full trajectory.
  • TrajectoryScore: composite trajectory rating; the canonical RAGEN-paired metric.
  • ReasoningQuality: scores whether the agent’s chain-of-thought is logically valid given observations.
  • StepEfficiency: catches the “more capable but wasteful” regression that RL can introduce.
  • ToolSelectionAccuracy: scores tool choices step-by-step.
  • Per-checkpoint regression delta: alert on >2% drop on any of the above vs. the prior checkpoint.
from fi.evals import TaskCompletion, TrajectoryScore, StepEfficiency

task = TaskCompletion()
trajectory = TrajectoryScore()
efficiency = StepEfficiency()

result = trajectory.evaluate(trajectory=trace_spans, goal=user_goal)
print(result.score, result.reason)

RAGEN, StarPO, and the 2026 agent-training stack

In our 2026 evals, RAGEN-trained checkpoints behave very differently from SFT-only or single-turn RLHF policies. The table is how we think about the failure modes that show up when teams ship RAGEN-trained agents against frontier benchmarks:

LayerWhat can go wrongEval signal that surfaces it
Trajectory rolloutInitial states too narrowHeld-out task completion cohort drop
Reward shapingShallow proxy rewardTrajectoryScore vs reward gap
StarPO updateEcho trap collapseReasoningQuality template collapse
Stabilization (RAGEN-D, RAGEN-S)Diversity lossAction-distribution entropy
Tool integrationWrong tool selectedToolSelectionAccuracy
Stop conditionsInfinite loopsStepEfficiency

The 2026 reference numbers are clarifying: frontier models like Claude Opus 4.7 and GPT-5.1 score in the 55-70% range on τ-bench and 70-78% on SWE-Bench Verified, but a domain RAGEN checkpoint trained on a narrow customer-support environment can beat them on that environment while losing badly on GAIA Level 3 and OSWorld. Unlike a RAGEN-paper-only setup that focuses on reward curves, FutureAGI’s golden-dataset workflow asks the harder question: does the policy generalize across agent trajectories the training loop never saw? That keeps RAGEN-trained agents honest after they leave the lab.

Common Mistakes

  • Optimising RAGEN reward without out-of-distribution eval. Self-play converges; held-out eval is what reveals whether the policy generalises.
  • Skipping StepEfficiency. RL policies can be capable and wasteful; track step count alongside completion.
  • Reward gaming. If your reward model has gaps, RAGEN finds them. eval against a separate evaluator suite, not the reward model.
  • Ignoring the echo trap. Without RAGEN-D / RAGEN-S stabilization, self-play collapses; track diversity metrics on the trajectory pool.
  • Comparing RAGEN against single-turn RLHF on single-turn benchmarks. That is not the regime RAGEN was designed for; use multi-turn trajectory benchmarks.
  • Skipping the frontier baseline. Every RAGEN checkpoint should be compared with Claude Opus 4.7, GPT-5.1, and Gemini 3 Pro on the same golden dataset so the trained policy’s lift is real, not artifactual.
  • Using saturated benchmarks as a release filter. MMLU, HumanEval, and GSM8K are saturated in 2026; trajectory benchmarks like τ-bench, SWE-Bench Verified, and BFCL v3 are the meaningful gates.

Production handoff from a RAGEN training loop

In our 2026 evals, the handoff between a RAGEN training run and a production deployment is the riskiest moment in agent shipping. The team has a checkpoint that scored well on its training environment; the question is whether it survives τ-bench, SWE-Bench Verified, BFCL v3, and the team’s golden dataset against frontier baselines like Claude Opus 4.7, GPT-5.1, and Gemini 3 Pro. FutureAGI scores the checkpoint on the same evaluator cohort as the frontier baselines. TaskCompletion, TrajectoryScore, ReasoningQuality, StepEfficiency, ToolSelectionAccuracy. and a release gate that requires the trained policy to match or beat the frontier model on the routes it will serve. Unlike a single reward curve, that comparison is empirical and defensible.

Frequently Asked Questions

What is RAGEN?

RAGEN is an open-source reinforcement-learning framework for training LLM agents in multi-turn stochastic environments, built around the StarPO algorithm for full-trajectory policy optimization.

How is RAGEN different from RLHF?

RLHF optimises single-turn responses against a preference reward. RAGEN optimises full multi-step trajectories. state, thinking, action, reward. for agents that interact with environments over many turns, where credit assignment is the hard problem.

How does FutureAGI fit into a RAGEN training loop?

RAGEN trains the agent; FutureAGI evaluates it. Run TaskCompletion, TrajectoryScore, and ReasoningQuality on the trained checkpoint against your golden dataset, and use the eval delta to gate every training run.