People utilize huge language models for a huge number of tasks, from translating an article to identifying financial fraud. However, despite the incredible capabilities and versatility of these models, they sometimes produce incorrect answers.
Worse still, models can be overconfident about incorrect answers or underconfident about correct answers, making it tough for the user to judge whether the model can be trusted.
Scientists typically calibrate a machine learning model to ensure that its confidence level matches its accuracy. A well-calibrated model should be less likely to make an incorrect prediction, and vice versa. But because huge language models (LLMs) can be applied to a seemingly endless collection of diverse tasks, conventional calibration methods are ineffective.
Now, researchers at MIT and the MIT-IBM Watson AI Lab have introduced a calibration method tailored to huge language models. Their method, called Thermometerinvolves building a smaller, auxiliary model that operates on top of the larger language model in order to calibrate it.
Thermometer is more productive than other approaches—it requires less power-intensive computation—while preserving the model’s accuracy and enabling it to generate better-calibrated responses to tasks it hasn’t seen before.
By enabling productive calibration of an LLM model for a variety of tasks, Thermometer can facilitate users identify situations where the model is overconfident and under-predicts errors, ultimately preventing the model from being deployed in a situation where it may fail.
“With the thermometer, we want to provide the user with a clear signal that tells them whether the model’s response is accurate or inaccurate, in a way that reflects the model’s uncertainty, so they know whether the model is reliable,” says Maohao Shen, a graduate student in electrical engineering and computer science (EECS) and lead author thermometer paper.
Shen is joined on the paper by Gregory Wornell, a Sumitomo professor of engineering who directs the Signals, Information, and Algorithms Laboratory at the Research Laboratory for Electronics and is a member of the MIT-IBM Watson AI Lab; senior author Soumya Ghosh, a research associate at the MIT-IBM Watson AI Lab; and others at MIT and the MIT-IBM Watson AI Lab. The research was recently presented at the International Conference on Machine Learning.
Universal calibration
Since conventional machine learning models are typically designed to perform a single task, their calibration usually involves a single task-specific method. On the other hand, since LLMs have the flexibility to perform multiple tasks, using a conventional method to calibrate the model for one task may harm its performance on another task.
Calibrating an LLM often involves repeatedly sampling the model to obtain different predictions, and then aggregating those predictions to obtain a better calibrated confidence. However, because these models have billions of parameters, the computational costs of such approaches quickly add up.
“In a sense, large language models are universal because they can handle a variety of tasks. So we need a universal calibration method that can also handle a variety of tasks,” Shen says.
With the thermometer, the researchers developed a versatile technique that uses a classical calibration method called temperature scaling to efficiently calibrate the LLM for a fresh task.
In this context, “temperature” is a scaling parameter used to adjust the model confidence to match the prediction accuracy. Traditionally, the proper temperature is determined using a labeled validation dataset of task-specific examples.
Because LLMs are often applied to fresh tasks, labeled data sets can be nearly impossible to obtain. For example, a user who wants to implement LLM to answer customer questions about a fresh product is unlikely to have a data set containing such questions and answers.
Instead of using a labeled dataset, the researchers train an auxiliary model that runs on top of the LLM and automatically predicts the temperature needed to calibrate it for this fresh task.
They utilize labeled datasets of several representative tasks to train the Thermometer model, and then, once training is complete, they can generalize the obtained results to fresh tasks in a similar category without the need for additional labeled data.
A thermometer model trained on a collection of datasets of multiple-choice questions, perhaps including one with algebra questions and one with medical questions, could be used to calibrate an LLM test that answers, for example, geometry or biology questions.
“Our goal is to make it work for every task, but we’re not there yet,” Ghosh says.
The thermometer model only needs access to a diminutive portion of the internal mechanisms of the LLM to predict the correct temperature that will calibrate its prediction for a specific task’s data points.
Effective approach
Importantly, this technique does not require many training runs and only slightly slows down LLM. Furthermore, because scaling the temperature does not change the model’s predictions, Thermometer maintains its accuracy.
When they compared the thermometer to several baselines across multiple tasks, they found that the uncertainty measurement results were better calibrated while requiring significantly less computation.
“If we train the Thermometer model on a large enough number of tasks, it should be able to generalize well to each new task, just like a large language model, it is also a general-purpose model,” Shen adds.
The researchers also found that if they train a thermometer model for a smaller LLM, it can be directly used to calibrate a larger LLM within the same family.
In the future, they want to adapt Thermometer to more intricate text-generation tasks and apply the technique to even larger LLMs. The researchers also hope to quantify the diversity and number of labeled datasets that would be needed to train the Thermometer model so that it can generalize to the fresh task.
This research was partially funded by the MIT-IBM Watson AI Lab.