Friday, March 6, 2026

The up-to-date method can escalate the effectiveness of LLM training

Share

However, developing reasoning models requires a huge amount of computation and energy due to the inefficiency of the learning process. While several high-power processors are constantly working on intricate queries, others in the group remain idle.

Researchers at MIT and elsewhere have found a way to utilize this computational downtime to effectively accelerate the training of reasoning models.

Their up-to-date method automatically trains a smaller, faster model to predict the results of the broader LLM reasoning that the larger model validates. This reduces the amount of work the reasoning model has to do, speeding up the training process.

The key to this system is its ability to adaptively train and deploy a smaller model so that it only starts when certain processors are idle. By using computational resources that would otherwise be wasted, it speeds up training without incurring additional overhead costs.

Tested on an LLM with multiple reasoning, it showed that the method doubled the learning speed while maintaining accuracy. This could reduce the costs and escalate energy efficiency of developing advanced LLMs for applications such as forecasting financial trends or detecting threats in power grids.

“People want models that can handle more complex tasks. But if that’s the goal of model development, we need to prioritize performance. We found a lossless solution to this problem, and then developed a full-stack system that can deliver quite dramatic speedups in practice,” says Qinghao Hu, a postdoc at MIT and co-author of the book article about this technique.

He is joined on the paper by co-author Shang Yang, an electrical engineering and computer science (EECS) graduate student; Junxian Guo, EECS graduate; senior author Song Han, associate professor at EECS, member of the Electronics Research Laboratory, and NVIDIA Distinguished Scientist; as well as others at NVIDIA, ETH Zurich, MIT-IBM Watson AI Lab, and the University of Massachusetts at Amherst. The research results will be presented at the ACM International Conference on Architectural Support for Programming Languages ​​and Operating Systems.

Training bottleneck

Developers want LLM reasoning to identify and correct errors through the critical thinking process. This feature enables them to complete intricate queries that would be hard for a standard LLM.

To teach them this skill, developers train LLMs in reasoning using a technique called reinforcement learning (RL). The model generates multiple potential answers to the query, receives a reward for the best candidate, and is updated based on the best answer. These steps repeat thousands of times as the model learns.

However, researchers found that the process of generating multiple responses, called implementation, can consume as much as 85 percent of the time needed for RL training.

“Updating the model—which is actually the ‘training’ part—takes very little time in comparison,” Hu says.

This bottleneck occurs in standard RL algorithms because all processors in the training group must complete their responses before they can proceed to the next step. Because some processors may be working on very long responses, others that have generated shorter responses wait for them to finish.

“Our goal was to turn this downtime into acceleration without any loss,” adds Hu.

To speed things up, they tried to utilize an existing technique called speculative decoding. Speculative decoding involves training a smaller model called a plotter to quickly guess the future output of a larger model.

The larger model verifies the draftsman’s guesses, and the answers he accepts are used for training.

Because a larger model can validate all of the draftsman’s guesses at once, rather than generating each result one at a time, it speeds up the process.

Adaptive solution

However, in speculative decoding, the drafting model is usually trained only once and remains immobile. This makes the technique infeasible for reinforcement learning because the reasoning model is updated thousands of times during training.

A immobile wizard would quickly become dated and unusable after just a few steps.

To overcome this problem, scientists have created a versatile system known as “taming the long tail” (TLT).

The first part of TLT is an adaptive draftsman trainer that uses free time on idle processors to train the draftsman model on the fly, keeping it in good alignment with the target model without using additional computational resources.

The second component, the adaptive deployment engine, manages speculative decoding to automatically select the optimal strategy for each up-to-date batch of input. This mechanism reconfigures speculative decoding based on characteristics of the training workload, such as the number of inputs processed by the working model and the number of inputs accepted by the target model during validation.

Additionally, researchers designed the working model to be lightweight so it could be trained quickly. TLT reuses some elements of the reasoning model training process for draftsman training, leading to additional speedup benefits.

“As soon as some processors finish their short queries and become idle, we immediately switch them to training a working model using the same data they use in the deployment process. The key mechanism is our adaptive speculative decoding – without it these gains would not be possible,” says Hu.

They tested TLT on multiple LLM reasoning skills that were trained using real-world datasets. The system accelerated training by 70 to 210 percent while maintaining the accuracy of each model.

As an added bonus, the little draftsman model can be easily used for effective implementation as a free by-product.

In the future, researchers want to integrate TLT with more types of training and inference structures and find up-to-date applications of reinforcement learning that can be accelerated by this approach.

“Since inference continues to be the main workload driving inference demand, Qinghao’s TLT technology is excellent at solving the computational bottleneck of training these inference models. I think this method will be very helpful in the context of efficient AI processing,” says Han.

This work is funded by the MIT-IBM Watson AI Lab, the MIT AI Hardware Program, the MIT Amazon Science Hub, Hyundai Motor Company, and the National Science Foundation.

Latest Posts

More News