Join our daily and weekly newsletters to receive the latest updates and exclusive content on our industry-leading AI coverage. Find out more
Although gigantic language models (LLM) are becoming more and more effective for elaborate tasks, in many cases you cannot get the answer right the first time. Therefore, there is growing interest in enabling LLMs to spot and correct errors, which is also known as “self-correction”. However, current attempts at self-correction are narrow and make demands that often cannot be met in real-world situations.
In a recent article, scientists from Google DeepMind to introduce Self-correction through reinforcement learning (SCoRe), a novel technique that significantly improves the self-correction capabilities of LLM using only self-generated data. SCoRe can be a valuable tool to raise the reliability and trustworthiness of LLMs and opens up recent opportunities to improve their reasoning and problem-solving abilities.
Importance of self-correction in LLM
“Self-correction is a skill that greatly improves human thinking,” Aviral Kumar, a scientist at Google DeepMind, told VentureBeat. “People often spend more time thinking, trying out many ideas, correcting their mistakes, and finally solving a challenging question, rather than simply creating solutions in one go. We wish LLMs could do the same.
Ideally, an LLM with forceful self-correction capabilities should be able to review and refine your own answers until you get the correct answer. This is particularly critical because LLMs often possess the knowledge needed to solve the problem internally, but do not apply it effectively when generating the initial response.
“From a fundamental machine learning perspective, no LLM is expected to be able to solve difficult zero-scope problems using their memory (no human certainly can), so we want LLMs to spend more time on thinking, calculating and improving themselves to succeed in difficult problems,” Kumar said.
Previous attempts to enable self-correction in LLM have relied on speedy engineering or tuning models specifically for self-correction. These methods typically assume that the model can receive external feedback on the quality of the results or has access to an “oracle” that can guide the self-correction process.
These techniques do not take advantage of the model’s internal self-correction capabilities. Supervised tuning (SFT) methods, which involve training a model to fix errors in the base model, have also shown limitations. They often require oracle feedback from human annotators or stronger models and do not rely on the model’s own knowledge. Some SFT methods even require multiple models during inference to validate and refine the answer, making them challenging to implement and apply.
Additionally, DeepMind research shows that while SFT methods can improve initial model responses, they do not perform well when the model must revise its responses in multiple steps, as is often the case with elaborate problems.
“It may be the case that at the end of training, the model knows how to fix the base model’s errors, but may not have enough capabilities to detect its own errors,” Kumar said.
Another challenge with SFT is that it can lead to unintended behavior, such as training the model to get the best answer on the first try and not changing it in subsequent steps, even if it is incorrect.
“We found that the behavior of SFT-trained models largely comes down to this ‘direct’ strategy rather than learning how to self-correct,” Kumar said.
Self-correction through reinforcement learning
To overcome the limitations of previous approaches, DeepMind researchers turned to reinforcement learning (RL).
“Today LLMs can’t do that [self-correction]as shown by previous studies examining self-correction. This is a fundamental issue,” Kumar said. “LLMs are not trained to look back and introspect on their mistakes, they are trained to give the best answer to a question. So we started developing self-correction methods.”
SCoRe trains a single model to both generate responses and correct its own errors without relying on external feedback. Importantly, SCoRe achieves this by training the model entirely on self-generated data, eliminating the need for external knowledge.
Previous attempts to apply RL for self-correction have relied primarily on single-turn interactions, which can lead to undesirable results such as a model that focuses solely on the final response and ignores the intermediate steps that drive self-correction.
“We see… ‘behavior breakdown’ in LLMs trained to self-correct using naive RL. He learned to simply ignore the instructions to self-correct and produce the best answer from his memory, in zero mode, without learning to self-correct,” Kumar said.
To prevent behavior breakdown, SCoRe uses a two-step training process with regularization techniques. In the first stage, SFT is replaced by a process that optimizes correction performance while ensuring that the initial model trials remain close to the base model results.
The second stage uses multi-pivot RL to optimize reward on both the first and subsequent trials, while also including a reward bonus that encourages the model to improve its responses from the first to the second trial.
“Both the initialization and the reward bonus ensure that the model cannot simply learn to provide the best response on the first try and only slightly edit it,” the researchers write. “Overall, SCoRe is able to extract knowledge from the base model to enable positive self-correction.”
SCoRe in action
DeepMind researchers evaluated SCoRe against existing methods that apply self-generated data for self-correction training. They focused on math and coding tasks, using benchmarks such as MATH, MBPP, and HumanEval.

The results showed that SCoRe significantly improved the self-correction capabilities of Gemini 1.0 Pro and 1.5 Flash models. For example, SCoRe achieved an absolute self-correction raise of 15.6% on MATH and a 9.1% raise on HumanEval compared to the baseline model, beating other self-correction methods by several percentage points.
The most noticeable improvement was the model’s ability to correct its errors from the first to the second attempt. SCoRe also significantly reduced the instances in which the model mistakenly changed the correct answer to an incorrect one, indicating that it had learned to apply corrections only when necessary.
Moreover, SCoRe has proven to be very effective when combined with inference time scaling strategies such as self-consistency. By splitting the same inference budget across multiple rounds of corrections, the SCoRe design enabled further performance gains.

Although the paper mainly focuses on encoding and reasoning tasks, the researchers believe that SCoRe could be beneficial in other applications as well.
“You could imagine training models to look at their results, which could potentially be dangerous, and improve them themselves before showing it to the user,” Kumar said.
The researchers believe their work has broader implications for university teacher education and highlights the importance of teaching models of independent reasoning and improvement, rather than simply mapping inputs to outputs.
