Because machine-learning models can make false predictions, researchers often equip them with the ability to tell users how confident they are in a given decision. This is especially vital in high-stakes situations, such as when models are used to aid identify disease in medical images or filter job applications.
But quantifications of model uncertainty are only useful if they are true. If a model says it is 49 percent certain that a medical image shows a pleural effusion, then the model should be right 49 percent of the time.
MIT researchers have introduced a fresh approach that can improve uncertainty estimates in machine learning models. Their method not only generates more true uncertainty estimates than other techniques, but does so more efficiently.
Furthermore, because the technique is scalable, it can be applied to large-scale deep learning models, which are increasingly used in healthcare and other safety-critical situations.
The technique can provide end users, many of whom are fresh to machine learning, with better information to aid them determine whether a model’s predictions can be trusted or whether it should be deployed for a particular task.
“It’s easy to see that these models perform really well in scenarios where they’re very good, and then assume that they’ll perform just as well in other scenarios. That makes it especially important to promote this kind of work that aims to better calibrate the uncertainty of these models to make sure that they’re consistent with human notions of uncertainty,” says lead author Nathan Ng, a graduate student at the University of Toronto who is a visiting student at MIT.
Ng wrote the paper with Roger Grosse, an assistant professor of computer science at the University of Toronto; and senior author Marzyeh Ghassemi, an associate professor in the Department of Electrical Engineering and Computer Science and a member of the Institute of Medical Engineering Sciences and the Laboratory for Information and Decision Systems. The research will be presented at the International Conference on Machine Learning.
Quantification of uncertainty
Uncertainty quantification methods often require sophisticated statistical computations that do not scale well to machine learning models with millions of parameters. These methods also require users to make assumptions about the model and the data used to train it.
The MIT researchers took a different approach. They exploit what’s known as the minimum description length (MDL) principle, which doesn’t require assumptions that can hamper the accuracy of other methods. MDL is used to better quantify and calibrate the uncertainty for the test points the model is supposed to label.
The technique the researchers developed, known as IF-COMP, makes MDL speedy enough to be used on enormous deep learning models used in many real-world situations.
MDL involves considering all possible labels that a model could give to a test point. If there are multiple alternative labels for that point that are good matches, its confidence in the chosen label should decrease accordingly.
“One way to test how confident a model is is to give it some counterfactual information and see how likely it is to believe,” Ng says.
For example, consider a model that says a medical image shows a pleural effusion. If researchers tell the model that the image shows edema and it is inclined to update its belief, the model should be less confident in its original decision.
In the case of MDL, if the model is confident in labeling a data point, it should exploit a very compact code to describe that point. If it is unsure about its decision because the point could have many other labels, it uses a longer code to capture those possibilities.
The amount of code used to label a data point is known as the stochastic complexity of the data. If researchers ask a model how willing it is to update its belief about a data point given evidence to the contrary, the stochastic complexity of the data should decrease if the model is confident.
However, testing every data point with MDL would require enormous computational effort.
Speeding up the process
With IF-COMP, the researchers developed an approximation technique that can accurately estimate the stochastic complexity of the data using a special function known as an influence function. They also used a statistical technique called temperature scaling that improves the calibration of the model results. This combination of influence functions and temperature scaling enables high-quality approximations of the stochastic complexity of the data.
Ultimately, IF-COMP can efficiently produce well-calibrated uncertainty quantifications that reflect the true confidence of the model. The technique can also determine whether the model has mislabeled certain data points or reveal which data points are outliers.
The researchers tested their system on these three tasks and found it to be faster and more true than other methods.
“It’s really important to make sure that the model is well-calibrated, and the need to detect when a particular prediction doesn’t look quite right is growing. Auditing tools are becoming increasingly necessary for machine learning problems because we’re using large amounts of unexamined data to build models that will be applied to problems that affect people,” Ghassemi says.
IF-COMP is model-agnostic, so it can provide true uncertainty quantifications for many types of machine learning models. This could enable its implementation in a wider range of real-world settings, ultimately helping more practitioners make better decisions.
“People need to understand that these systems are very unreliable and can make things up on the fly. The model can feel very confident, but there are a lot of different things it is willing to believe, given the evidence to the contrary,” Ng says.
In the future, the researchers want to apply their approach to enormous language models and explore other potential exploit cases for the minimum description length principle.