Wednesday, March 11, 2026

Nvidia researchers boost LLM reasoning skills by making them “think” during pre-training training

Share

Nvidia researchers have developed a novel technique that flips the script on learning to reason through vast language models (LLMs).

The so-called method reinforcement learning before training (RLP), incorporates RL into the initial phase of training rather than leaving it until the end.

That’s the approach encourages the model to “think independently before predicting what will happen next, thus teaching independent thinking earlier in the initial training” – say the scientists in their article.

By learning to reason on plain text without the need to exploit external verifiers, models trained with RLP show significant improvement in learning intricate reasoning tasks downstream, pointing to a future of more competent and adaptable AI for real-world tasks.

Typical LLM training cycle

Typically, vast language models are first pre-trained on huge amounts of text using a “next token prediction” goal, where they are given a string of text and asked to continually guess what the next word (or token) will be. At this stage they learn grammar, facts and basic associations.

Later in the post-training phase, models typically learn intricate reasoning skills such as chain of thought (CoT), where the model presents its reasoning step by step. This step often includes supervised tuning (SFT) or reinforcement learning based on human feedback (RLHF), which require specialized, selected data sets.

The paper’s authors argue that this sequential process does not correspond to human understanding, which is not “a linear, character-by-character process, but rather a parallel integration of input with prior knowledge.” Existing pretraining methods lack this mechanism, hindering the model’s ability to develop deep reasoning from the beginning.

How pre-training reinforcement learning works

RLP changes this process by treating CoT generation as an action taken by the model before predicting the next token. At each step, the model first generates an internal chain of “thoughts” or reasoning. It then predicts the next word in the text using the original context enriched with the novel thought.

The model receives a reward based on how much its thought improved its prediction accuracy compared to a baseline that generated no thought (only predicting the next token). This reward signal is automatically calculated based on the change in probability, eliminating the need for external validators or human-labeled data.

The reward is only positive when the generated thought helps the model better predict the next token. By rewarding thoughts based on their predictive benefits, RLP effectively teaches the model how to think usefully on the same massive, unstructured datasets used in standard pre-training.

This constant feedback loop allows the model to learn when plain predictions are enough and when deeper reasoning is needed. As the researchers put it, “RLP aims to shape thinking in fundamental models through rewarding only those thoughts that measurably help predict the next chip.”

However, this basic approach does not make later tuning steps obsolete. According to Bryan Catanzaro, vice president of applied deep learning research at Nvidia and co-author of the paper, RLP is intended to complement, not replace, these key steps. “RLP is not intended to replace later post-training steps such as supervised tuning or reinforcement learning based on human feedback,” Catanzaro told VentureBeat. “These stages remain crucial to improving the behavior of the model…Their goal is to really increase the efficiency of later phases by giving the model an edge.”

RLP in action

In experiments with Qwen3-1.7B AND Nemotron-Nano-12Bthe Nvidia team tested RLP on a number of math and science benchmarks. The results show this models boosted with RLP consistently outperformed their conventionally trained counterparts, with particularly vast gains on reasoning-intensive tasks.

For the enterprise, this improved reasoning can translate into more reliable results in multi-step workflows such as financial analysis or legal document summarization.

“RLP encourages the pre-training model to think before it makes predictions, helping the model internalize a more consistent reasoning style,” Catanzaro said. “This can help reduce subtle logical errors, especially in longer workflows.”

Stressing that RLP-trained models will still need the usual safeguards such as verification layers, human supervision and consistency checks, Catanzaro stated that “RLP provides a stronger baseline.”

Importantly, the benefits of the RLP relationship rather than disappearing during subsequent fine-tuning stages (catastrophic forgetting is a common problem in LLM training, where later stages of training cause the model to forget previously learned skills and knowledge). The model trained with RLP achieved an overall score that was 7–8% higher than baseline after an identical post-training regimen. The researchers concluded that RLP “establishes a solid foundation of reasoning that is not removed by subsequent adjustments but is instead combined with training after training.”

The key finding is the effectiveness of this technique. In the Qwen3-1.7B model, RLP improved performance by 17% compared to standard continuous pretraining and also outperformed a similar technique called reinforced pretraining via prefix matching rewards (RPT). This advantage persisted even when the base model was trained with 35 times more data to cover the computational costs, confirming that the benefits are due to the method itself and not just to more processing.

Moreover, RLP demonstrates impressive scalability and versatility, effectively extracting signal from general internet data rather than just selected datasets. When applied to the Mamba-Transformer Nemotron-Nano-12B hybrid model, RLP achieved a relative improvement of 35% compared to an intensely trained baseline using a diminutive part of the data.

While these results point to a more competent path to building effective models, Catanzaro sees this innovation as a fundamental change in the learning process itself, rather than an immediate solution to high training costs.

“This study is exciting because it offers a change in the way models absorb information during pre-training, leading to smarter learning,” he explained. “It wouldn’t replace large-scale initial training, but it would provide another creative method for building the best possible models.”

The novel foundation for AI training

Ultimately, RLP points to a future where pretraining is no longer a monolithic process of predicting the next token. Instead, the next generation of models can be built around a hybrid of goals, creating an AI that learns more stalwart thinking from day one. Catanzaro offers a powerful analogy to illustrate this change:

“Predicting the next symbol teaches the model what the world looks like; reinforcement-style targets like RLP can teach it how to think about what it sees,” he said. “Combining these two goals can help models develop deeper, more structured thinking well into their training…Tools like RLP can build on this foundation, making learning more active, more interesting and even more effective.”

There is still a lot to learn about the dynamics of reinforcement learning in the pre-training phase, but it seems clear that “introducing exploration earlier in training opens up a new axis of scaling – not only in terms of size, but also in how models learn to reason,” Catanzaro said.

Latest Posts

More News