Base models are massive deep learning models that have been pre-trained on a huge amount of generic, unlabeled data. They can be used for a variety of tasks, such as generating images or answering customer questions. However, these models, which underpin powerful AI tools like ChatGPT and DALL-E, can offer incorrect or misleading information. In safety-critical situations, such as a pedestrian approaching an autonomous car, these errors can have stern consequences.
To support prevent such errors, scientists from MIT and MIT-IBM Watson AI Lab developed the technique to assess the reliability of the foundation models before deploying them to a specific task. They do this by considering a set of foundation models that are slightly different from each other. They then exploit their algorithm to assess the consistency of the representations that each model learns about the same test data point. If the representations are consistent, then the model is reliable.
By comparing their technique with state-of-the-art baseline methods, they found that it better reflects the reliability of foundation models across a variety of final classification tasks.
Someone could exploit this technique to decide whether a model should be used in a particular setting without having to test it on a real dataset. This could be particularly useful where datasets may be inaccessible due to privacy concerns, such as in healthcare settings. Additionally, the technique could be used to classify models based on reliability scores, allowing the user to select the best model for their task.
He was joined on the paper by lead author Newborn-Jin Park, a LIDS student; Hao Wang, a research scientist at the MIT-IBM Watson AI Lab; and Shervin Ardeshir, a senior research scientist at Netflix. The paper will be presented at the Conference on Uncertainty in Artificial Intelligence.
Consensus Measurement
Conventional machine learning models are trained to perform a specific task. These models typically make specific predictions based on input data. For example, a model might tell you whether a given image contains a cat or a dog. In this case, assessing robustness might involve testing the final prediction to see if the model is correct.
But foundation models are different. The model is pre-trained using general data, in a setting where its creators don’t know all the tasks it will be applied to. Users customize it for their specific tasks after it’s been trained. Unlike time-honored machine-learning models, foundation models don’t produce specific outputs, such as labels like “cat” or “dog.” Instead, they generate an abstract representation based on an input data point. To assess the robustness of the foundation model, the researchers used an ensemble approach, training several models that share many properties but differ slightly from each other.
But they encountered a problem: how could they compare abstract representations?
They solved this problem by using an idea called neighborhood consistency. In their approach, the researchers prepare a set of reliable reference points to test on an ensemble of models. Then, for each model, they examine reference points that are near the representation of the test point in that model. By analyzing the consistency of neighboring points, they can estimate the reliability of the models.
Equalizing representation
Basic models map data points to what is known as representational space. One way to think about this space is to treat it as a sphere. Each model maps similar data points to the same part of its sphere, so images of cats end up in one place and images of dogs in another. However, each model would map animals differently in its own sphere, so while cats might be grouped near the South Pole of one sphere, another model might map cats somewhere in the Northern Hemisphere.
Scientists exploit neighboring points as anchors to align these spheres, so they can make the representations comparable. If a data point’s neighbors are consistent across multiple representations, you can be confident in the reliability of the model’s output for that point.
When they tested the approach on a wide range of classification tasks, they found that it was much more consistent than the baselines. It also wasn’t hampered by the tough test points that caused other methods to fail. What’s more, their approach can be used to assess the robustness of any input, so you can assess how well a model works for a specific type of person, such as a patient with specific characteristics.
However, one limitation comes from the fact that they have to train an ensemble of foundation models, which is computationally pricey. In the future, they plan to find more proficient ways to build multiple models, perhaps by exploiting tiny perturbations of a single model.
This work is funded in part by MIT-IBM Watson AI Lab, MathWorks, and Amazon.