Monday, April 21, 2025

Not every AI prompt deserves for many seconds of thinking: how the meta teaches models to determine priorities

Share


Join our daily and weekly newsletters to get the latest updates and exclusive content regarding the leading scope of artificial intelligence. Learn more


Models of reasoning, such as OpenAI O1 and Deepseek-R1, have a problem: they exaggerate. Ask them a basic question, such as “what is 1+1?” And they will think for a few seconds before the answer.

Ideally, like people, AI models should be able to say when to give a direct answer and when to spend additional time and resources to reason before response. AND new technique presented by scientists in Meta AI and University of Illinois Chicago Trains models for allocation of application budgets based on the difficulty of the inquiry. This causes faster answers, reduced costs and better allocation of computing resources.

Deepseek solution 1+1

High-priced reasoning

Gigantic language models (LLM) can improve the efficiency of reasoning problems when they produce longer chains of reasoning, often referred to as a “thinking chain” (COT). The success of COT has led to the entire scope of scaling techniques of inference time, which prompted the model to “longer thinking” about the problem, producing and reviewing many answers and choosing the best.

One of the main ways used in reasoning models is generating many answers and the choice of the one that most often repeats, also known as “voting of the majority” (MV). The problem with this approach is that the model adopts uniform behavior, treating every prompt as a problem with complex reasoning and issuing unnecessary resources to generate many answers.

Wise reasoning

Their experiments show that SV exceeds classic MV in problems with mathematical competition when it generates the same number of answers. However, SV requires additional instructions and generation of tokens, which puts it on an equal footing with MV in terms of the token coefficient to accuracy.

SV exceeds MV on the number of answers, but corresponds to the number of tokens (source: Arxiv)

The second technique, “adaptive sequential voting” (ASV), improves SV, encouraging the model to examine the problem and generate many answers only when the problem is complex. In the case of basic problems (such as 1+1 prompt), the model simply generates one answer without passing the voting process. This makes the model much more proficient in solving both basic and complicated problems.

Learning strengthening

While both SV and ASV improve the performance of the model, they require many hand -marked data. To alleviate this problem, scientists propose “optimization of policy limited by the budget” (IBPO), a reinforcement algorithm that teaches the model to adjust the length of reasoning based on the difficulty of the inquiry.

IBPO aims to enable LLM to optimize their response, remaining as part of limiting the budget of the application. The RL algorithm allows the model to exceed profits obtained through hand -marked data training by constantly generating ASV traces, assessment of answers and the selection of results that ensure the correct answer and the optimal budget of inference.

Their experiments show that IBPO improves the Pareto front, which means for the agreed budget, the model trained on IBPO exceeds other basic foundations.

IBPO (Green Circles) exceeds other base lines on the Pareto front (source: ARXIV)

Discoveries appear against the background of researchers warning that current AI models hit the wall. Companies are fighting to find high quality training data and study alternative methods of improving their models.

One promising solution is learning to reinforce, in which the model receives the goal and allows you to find its own solutions, as opposed to supervised tuning (SFT), in which the model is trained on the basis of examples.

Surprisingly, the model often finds solutions that people did not think about. This is a formula that seems to work well for Deepseek-R1, which questioned the dominance of American AI laboratories.

Scientists note that “methods based on hints and based on SFT struggle with both absolute improvement and performance, confirming the hypothesis that SFT itself does not allow self -control ability. This observation is also partly supported by a concurrent work, which suggests that such behaviors of self -dismantling appear automatically during RL, and not manually created by a hint or SFT. “

Latest Posts

More News