Scientists from Google Cloud AND University of California proposed a modern reinforcement learning framework that significantly improves the ability of language models to learn very challenging multi-step reasoning tasks. Supervised reinforcement learning (SRL) reframes problem solving as a sequence of logical “actions,” providing wealthy learning signals during the training process.
This approach enables smaller models to learn complicated problems that were previously beyond the reach of other common training techniques. Experiments show that SRL not only excels in mathematical reasoning benchmarks, but also generalizes effectively to agent software engineering tasks.
SRL is a comprehensive training platform that can elevate smaller and cheaper models to higher reasoning skills.
The limits of current LLM reasoning training
Recent progress in training vast language models (LLMs) for reasoning is largely due to reinforcement learning with verifiable rewards (RLVR), a method in which the model is rewarded based on the correctness of its final answer. By repeatedly trying to solve problems and obtaining feedback on the final result, the model gradually learns effective problem-solving strategies.
However, the success of this outcome-based approach depends on the model’s ability to find the correct solution within a restricted number of trials, or “deployments.” Since each implementation is computationally pricey, models cannot be tried indefinitely. This method hits home when the problems are so challenging that the model rarely, if ever, finds the right answer within its budget.
This creates a critical bottleneck in the learning process. For many multi-step reasoning problems, the model can solve several steps correctly, but a single error can derail it, leading to an incorrect answer. In the case of RLVR, all this effort is negatively rewarded, and the model learns nothing from its partially correct work. It is an all-or-nothing approach that provides no detailed feedback and provides scant rewards.
An alternative method is supervised tuning (SFT), in which the model learns from examples containing the full reasoning process developed by experts. While SFT can instill reasoning skills, it often leads to overfitting (the model simply learns to imitate trajectories from the training data, rather than learning to generalize to problems beyond the examples it has seen). This problem is exacerbated by the fact that high-quality, human-made training data is both infrequent and pricey to produce.
As the article notes, these limitations leave “a critical gap in training small open source models to effectively learn difficult problems.”
How supervised reinforcement learning works
SRL introduces a framework that reframes problem solving as a “sequential decision-making process,” striking a balance between pure performance-based RL and pure imitation learning. Instead of optimizing only for the final answer or forcing the model to mimic the expert’s entire thought process, SRL trains the model to reproduce the sequence of key actions that form the basis of the expert’s reasoning. This allows the model to learn to act like an expert while developing its own internal reasoning style.
Within SRL, expert demonstrations are broken down into a series of intermediate, concrete actions, each of which constitutes a significant step. In a math problem, the action may be an algebraic manipulation. In the case of a software engineering agent, this may be a command executed in a code repository. To generate training data, SRL uses a powerful teacher model to create solution trajectories, which are then used to train a smaller model.
According to I-Hung Hsu, a researcher at Google and co-author of the paper, this indirect approach is the key to its effectiveness in real-world scenarios. “SRL is in the middle: it captures the structured flexibility of solving real-world problems where there are many correct strategies, but also clear ideas about what ‘good reasoning’ looks like at every step,” Hsu told VentureBeat. “This makes SRL suitable for fields such as data analysis automation or perhaps supply chain optimization – tasks that reward sound intermediate reasoning rather than mere definitive answers.”
During training, the model first generates an “internal monologue” (its internal reasoning process, included in
SRL in action
The researchers’ experiments show that SRL significantly outperforms mighty baselines on both challenging math tests and agent-based software engineering tests. They also observed that SRL encourages more malleable and sophisticated reasoning patterns in models, such as interleaved planning and self-verification, which improve the quality of solutions without merely prolonging results.
For business leaders, productivity gains are only valuable if they don’t come at the cost of uncontrollable costs. Hsu explains that models trained with SRL are more capable in their reasoning. “The benefits come from better quality and structure of reasoning, not from verbosity,” he said. “In terms of performance, models trained with SRL are roughly comparable to the baseline model in terms of token usage… although SRL is not designed to reduce inference cost, it achieves better inference performance without increasing it.”
The team tuned in for math tests Qwen2.5-7B-Instruction on a dataset containing 1,000 challenging math questions. They compared its performance to models trained with SFT and RLVR (using the GRPO algorithm common in models such as DeepSeek-R1) on four competition-level mathematical patterns. The model trained with SRL achieved a significant average performance boost of 3.0% compared to other methods.
The team extended SRL to agent-based software engineering, a domain critical to enterprise automation. They trained a model specializing in coding, Qwen2.5-Coder-7B manualon 5000 expert trajectories of agents interacting with the coding environment. The SRL-trained model was compared with the original baseline model and SWE-Gym-7B, a mighty baseline model refined with SFT. SRL achieved a task resolution rate of 14.8%, a relative improvement of 74% compared to the SFT-based model. This demonstrates SRL’s ability to train more competent AI agents to perform complicated real-world programming tasks.
A modern high-stakes standard for artificial intelligence?
The best results in this article were achieved by combining methods: first, using SRL to teach basic reasoning, and then using RLVR to refine this skill. In their experiments, when the researchers used SRL pre-training and RLVR post-training, they observed an average boost of 3.7%, demonstrating an effective curriculum-based teaching strategy.
The question arises whether this could become a modern blueprint for building specialized artificial intelligence.
“We see SRL as a strong foundation,” Hsu said. “In a sense, SRL provides a curriculum—teaching models of thinking and acting step by step—before we refine those behaviors through performance-based reinforcement learning. This SRL-focused approach not only stabilizes the later stage of RL, but also makes reasoning more understandable and generalizable, which is crucial in high-stakes applications.”
Looking ahead, Hsu acknowledges that there are still challenges to scaling this pipeline, especially the high cost and complexity of end-to-end RLVR for agent-based tasks. However, he is positive about the further development path. “While high-quality expert trajectories remain important,” he concluded, “we believe the next big step will come from automating their generation and filtering – by using robust teacher models and even self-improving student models to load new data.”
