Agents

What Is StarPO?

A trajectory-level reinforcement-learning framework that optimizes state, thought, action, and reward sequences for LLM agents.

What Is StarPO?

StarPO (State-Thinking-Actions-Reward Policy Optimization) is an agent-training method that optimizes a full multi-turn trajectory, not just a single response or isolated action. In an agent training workflow, it treats states, reasoning traces, tool actions, observations, and rewards as one sequence, then updates the policy toward higher trajectory-level return. FutureAGI does not expose StarPO as a native optimizer; teams use trajectory traces and agent evaluators to study the same long-horizon reliability questions and validate trained checkpoints before rollout.

StarPO sits in a 2025-2026 research family alongside RAGEN, agent-specific GRPO variants, and process-reward training. Most production teams do not train with StarPO directly; they evaluate StarPO-derived or StarPO-inspired checkpoints to make sure trajectory-level wins do not regress on real workflows. By May 2026, every major frontier lab has published agent-training work in the trajectory-RL family. DeepMind’s process-reward research, Anthropic’s agent-tuning notes, OpenAI’s o-series reasoning posts. and StarPO remains the cleanest published recipe for the open-weights community.

Why StarPO matters in production LLM and agent systems

StarPO matters because agent failures are usually path-dependent. A planner can take a weak first step, call a nearly correct tool, misread the observation, and still produce a confident final answer. If training only rewards the final message, the model can learn shallow shortcuts: repeat a high-reward action, invent reasoning that sounds useful, or stop early because the benchmark reward is easier than the real task. a textbook reward-hacking failure mode.

The pain lands on several teams. ML engineers see unstable reward curves and gradient spikes when long-horizon rollouts have sparse or noisy rewards. Platform engineers see agents that look good in offline single-turn evals but fail when a real environment responds with partial state. Product teams see user-visible symptoms: repeated tool calls, inconsistent plans, incorrect task-completion claims, and sudden behavior changes after a prompt or model update.

This is especially relevant for 2026-era agent stacks because a user request often becomes a planner call, retrieval call, tool call, verifier call, and finalizer call. Unlike PPO, GRPO, or DPO-style single-turn preference training, StarPO is about credit assignment across the full interaction path. The reliability question is not only “was the final answer preferred?” It is “which earlier reasoning or action caused the final outcome?”

How FutureAGI handles StarPO

FutureAGI’s approach is to treat StarPO as a conceptual training method, not a native FutureAGI optimizer surface. There is no StarPO endpoint in FutureAGI. The practical workflow is to map the same state, reasoning, action, and reward idea onto production traces and offline eval datasets: traceAI-openai-agents or traceAI-langchain records each step with agent.trajectory.step, while TrajectoryScore, TaskCompletion, and ToolSelectionAccuracy score whether the path actually reached the goal.

StarPO componentProduction analogFutureAGI evaluator
StateSpan context, retrieval stateContextRelevance
ThinkingReasoning trace, planner outputCustomEvaluation rubric
ActionsTool calls, function callsToolSelectionAccuracy
RewardTask success on golden datasetTaskCompletion
TrajectoryFull step sequenceTrajectoryScore
Credit assignmentPer-step span attributionagent.trajectory.step

For example, a shopping agent trained with StarPO-style rollouts might visit a product page, compare constraints, call inventory, ask a clarification question, and complete checkout. In FutureAGI, the engineer stores those runs as trajectory rows, groups spans by agent.trajectory.step, and evaluates the full trace. TrajectoryScore catches a path that reaches checkout through an unsafe workaround. ToolSelectionAccuracy checks whether each tool call was the right move at the right step. TaskCompletion confirms the final goal was achieved.

The engineer’s next move is operational, not philosophical. If StarPO-trained candidates improve final reward but fail on checkout edge cases, the team does not ship the model. They add reward shaping around unsafe actions, replay the same dataset, and compare eval-fail-rate-by-cohort. If a candidate has higher TaskCompletion but doubles llm.token_count.prompt, the rollout goes through another cost-quality review before deployment.

We’ve found StarPO-style training improves agent performance reliably on trajectories with clear environment feedback (shopping, support workflows with structured outcomes, SWE-Bench style code tasks) and improves it less reliably on open-ended creative tasks where the reward signal is fuzzier. The right rollout pattern is therefore narrow: ship StarPO checkpoints first on closed-loop agent surfaces, hold them back on creative-writing or persona-driven applications until the reward signal there is honest.

How to measure or detect StarPO behavior

Measure StarPO through trajectory quality, reward stability, and deployment regressions:

  • TrajectoryScore. evaluates the full action path, so bad intermediate choices are visible even when the final answer looks acceptable.
  • TaskCompletion. checks whether the trajectory achieved the user’s goal, not merely produced a fluent response.
  • ToolSelectionAccuracy. scores per-step tool choice, the most common failure surface in trajectory training.
  • agent.trajectory.step. groups failures by planner, tool, retriever, verifier, or finalizer span.
  • Dashboard signals. trajectory reward variance, eval-fail-rate-by-cohort, p99 steps-per-task, token-cost-per-trace, and rollback rate after model promotion.
  • τ-bench / GAIA / OSWorld. the 2026 trajectory benchmarks worth pairing with StarPO-trained candidates before rollout.

Minimal Python:

from fi.evals import TrajectoryScore, TaskCompletion, ToolSelectionAccuracy

path = TrajectoryScore().evaluate(input=user_goal, output=trajectory)
task = TaskCompletion().evaluate(input=user_goal, output=final_answer)
tools = ToolSelectionAccuracy().evaluate(trajectory=trajectory, expected_tool="inventory")
print(path.score, task.score, tools.score)

Common mistakes

  • Optimizing only terminal reward. Sparse final rewards can hide bad reasoning, unsafe actions, or lucky environment states.
  • Confusing StarPO with StaRPO. StarPO is state-thinking-actions-reward trajectory optimization; similarly named stability methods solve a different reasoning-control problem.
  • Treating thoughts as ground truth. Reasoning text is model output, so score it against state transitions and actions before using it as reward evidence.
  • Ignoring rollout freshness. Agents overfit stale environments; refresh states and tasks before declaring trajectory learning stable.
  • Shipping reward gains without trace checks. A higher average reward is not enough if ToolSelectionAccuracy or TrajectoryScore regress.
  • Benchmarking only on saturated single-turn suites. Trajectory training shows on τ-bench, SWE-Bench Verified, and GAIA, not on MMLU.
  • Letting the reward signal drift during training. A judge model used inside the reward loop must be pinned to a snapshot; vendor updates mid-training are an under-reported cause of reward-curve instability.
  • Mixing in-distribution and out-of-distribution rollouts when reporting. OOD rollouts often look worse because the reward shaping was tuned for in-distribution; separate the dashboards.

Frequently Asked Questions

What is StarPO?

StarPO is State-Thinking-Actions-Reward Policy Optimization, a trajectory-level reinforcement-learning framework for LLM agents. It optimizes the whole multi-turn run rather than one response or tool call.

How is StarPO different from PPO?

PPO is a general policy-optimization algorithm often adapted to single-turn LLM preference training. StarPO is framed around the full agent trajectory, including state, reasoning, action, and reward transitions.

How do you measure StarPO?

In FutureAGI, use trajectory-level signals such as agent.trajectory.step plus TrajectoryScore, TaskCompletion, and ToolSelectionAccuracy to inspect long-horizon agent behavior.