Teaching AI models to say "I'm not sure"

Share

Confidence is convincing. In artificial intelligence systems, this is often misleading.

Today’s most capable reasoning models share the same quality as the loudest voice in the room: they deliver every answer with the same unwavering certainty, whether they are right or guessing. Researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have linked this overconfidence to a specific flaw in the way these models are trained and developed a method that fixes it without sacrificing any accuracy.

This technique, called RLCR (Reinforcement Learning with Calibration Rewards), trains language models to produce calibrated confidence estimates along with responses. In addition to finding the answer, the model takes into account the uncertainty of that answer and generates a confidence score. In multi-benchmark experiments, RLCR reduced calibration error by up to 90 percent, maintaining or improving accuracy, both for tasks the model was trained on and completely novel ones it had never seen. The work will be presented later this month at the International Conference on Learning Representations.

The problem has a surprisingly plain origin. The reinforcement learning (RL) methods behind recent breakthroughs in AI reasoning, including the training approaches used in systems like OpenAI’s o1, reward models for getting the right answer and punish them for getting the answer wrong. Nothing in between. A model that finds the correct answer through careful reasoning receives the same reward as a model that guesses correctly by chance. Over time, this trains models to confidently answer any question they are asked, whether they have sturdy evidence or successfully flip a coin.

Overconfidence has its consequences. When models are deployed in medicine, law, finance, or anywhere else where users make decisions based on AI results, a system that expresses high confidence regardless of its actual confidence becomes unreliable in ways that are hard to detect from the outside. A model that says “I’m 95 percent sure” while only being true half the time is more risky than a model that simply gives the wrong answer because users have no signal to seek a second opinion.

“The standard learning approach is simple and effective, but it doesn’t give the model the incentive to express uncertainty or say I don’t know,” says Mehul Damani, a doctoral student at MIT and co-author of the paper paper. “So the model naturally learns to guess when it is unsure.”

RLCR solves this problem by adding one term to the reward function: the Brier score, a well-established metric that penalizes the gap between a model’s claimed confidence and its actual accuracy. During training, models learn to reason about both the problem and their own uncertainty, obtaining both an answer and a confidence estimate. Certain wrong answers are penalized. So the correct ones are unnecessarily uncertain.

The math backs it up: the team has formally proven that this type of reward structure ensures that models are both true and well-calibrated. They then tested the approach on a 7-billion-parameter model using a range of question-answering and mathematical benchmarks, including six datasets on which the model had never been trained.

The results showed a consistent pattern. Standard RL training actively degraded calibration compared to the baseline model, making the models worse at estimating their own uncertainty. RLCR reversed this effect, significantly improving calibration without sacrificing accuracy. This method also proved to be more effective than post-hoc approaches, in which a separate classifier is trained to assign confidence scores after the fact. “What’s striking is that regular RL training not only doesn’t help calibration. It actively harms it,” says Isha Puri, a doctoral student at MIT and co-author. “Models are becoming both more capable and overconfident.”

The team also demonstrated that confidence estimates obtained using the RLCR method are practically useful at the time of inference. When models generate multiple candidate answers, selecting the one with the highest self-reported confidence level or weighting votes by confidence in a majority voting scheme improves both accuracy and calibration as computational scales.

An additional finding suggests that the act of reasoning about uncertainty itself has value. The researchers trained classifiers on model output and found that including explicit inference about model uncertainty in the input improved classifier performance, especially for smaller models. The model’s self-reflective reasoning about what she does and does not know contains real information, not mere decoration.

In addition to Damani and Puri, other authors on the paper include Stewart Slocum, Idan Shenfeld, Leshem Choshen, and senior authors Jacob Andreas and Yoon Kim.

The AI Sckool

Categories

Teaching AI models to say “I’m not sure”

This summer travel season could forever change the future of sustainable aviation fuel

How to write to files in Python: a beginner’s guide

Uranus’s moons may be the key to finding lost planets

Deep dive into language model calibration: Platt scaling, isotonic regression, temperature scaling

The US has a plan to combat snails. It covers many more flies

More News

The startup helps retailers track their products in real time

NSF resumes support for MIT-led Artificial Intelligence and Physics Institute, expanding up-to-date discovery model

Teach AI agents to ask better questions by playing ‘Battleship’

MIT researchers teach AI models to interpret graphs

This summer travel season could forever change the future of sustainable aviation fuel

How to write to files in Python: a beginner’s guide

Uranus’s moons may be the key to finding lost planets