Friday, December 27, 2024

How to Assess the Reliability of a Universal AI Model Before Deploying It

Share

Base models are massive deep learning models that have been pre-trained on a huge amount of generic, unlabeled data. They can be applied to a variety of tasks, such as generating images or answering customer questions.

But these models, which underpin powerful AI tools like ChatGPT and DALL-E, can provide incorrect or misleading information. In safety-critical situations, such as a pedestrian approaching an autonomous car, these errors can have grave consequences.

To support prevent such errors, researchers from MIT and MIT-IBM Watson AI Lab developed a technique to assess the reliability of foundation models before implementing them for a specific task.

They do this by taking a set of foundation models that are slightly different from each other. They then exploit their algorithm to assess the consistency of the representations that each model learns about the same test data point. If the representations are consistent, it means that the model is reliable.

By comparing their technique with state-of-the-art baseline methods, they found that it better reflects the reliability of foundation models across a variety of final classification tasks.

Someone could exploit this technique to decide whether a model should be used in a particular setting without having to test it on a real dataset. This could be particularly useful where datasets may be inaccessible due to privacy concerns, such as in healthcare settings. Additionally, the technique could be used to classify models based on reliability scores, allowing the user to select the best model for their task.

“All models can be wrong, but models that know when they are wrong are more useful. The problem of quantifying uncertainty or reliability is more difficult for these baseline models because their abstract representations are difficult to compare. Our method allows us to determine how reliable a model representation is for any given input,” says senior author Navid Azizan, Esther and Harold E. Edgerton Assistant Professor in the MIT Department of Mechanical Engineering and the Institute for Data, Systems, and Society (IDSS) and a member of the Laboratory for Information and Decision Systems (LIDS).

It is attached to article about work by lead author Adolescent-Jin Park, a LIDS student; Hao Wang, a research scientist at the MIT-IBM Watson AI Lab; and Shervin Ardeshir, a senior research scientist at Netflix. The paper will be presented at the Conference on Uncertainty in Artificial Intelligence.

Consensus Measurement

Customary machine learning models are trained to perform a specific task. These models typically make specific predictions based on input data. For example, a model might tell you whether a given image contains a cat or a dog. In this case, assessing robustness might involve testing the final prediction to see if the model is correct.

But foundational models are different. The model is pre-trained using general data, in a setting where its creators don’t know all the downstream tasks it will be applied to. Users tune it for their specific tasks after it’s been trained.

Unlike customary machine learning models, fundamental models do not produce specific outputs such as labels of “cat” or “dog.” Instead, they generate an abstract representation based on an input data point.

To assess the reliability of the foundation model, the researchers used an ensemble approach, training several models that share many common features but differ slightly from each other.

“Our idea is like measuring consensus. If all of these underlying models produce consistent representations for any given data in our dataset, we can say that this model is reliable,” Park says.

But they encountered a problem: how could they compare abstract representations?

“These models simply generate a vector of some numbers, so we can’t easily compare them,” he adds.

They solved this problem using the concept of neighborhood cohesion.

In their approach, the researchers prepare a set of reliable reference points for testing on an ensemble of models. Then, for each model, they examine reference points that are close to the representation of the test point in that model.

By analyzing the consistency of neighboring points, the reliability of the models can be estimated.

Equalizing representation

Basic models map data points to what is known as a representational space. One way to think about this space is to treat it as a sphere. Each model maps similar data points to the same part of its sphere, so images of cats go to one place and images of dogs go to another.

However, each model would map animals differently within its own sphere, so while cats might be placed near the South Pole of one sphere, another model might map cats somewhere in the Northern Hemisphere.

Scientists exploit neighboring points as anchors to align these spheres, so they can make the representations comparable. If a data point’s neighbors are consistent across multiple representations, you can be confident in the reliability of the model’s output for that point.

When they tested the approach on a wide range of classification tasks, they found that it was much more consistent than the baselines. It also wasn’t hampered by the tough test points that caused other methods to fail.

Moreover, their approach can be used to assess the reliability of any input data, so you can see how well a model performs for a particular type of person, such as a patient with certain characteristics.

“Even if all models have average performance overall, from an individual perspective, you prefer the one that performs best for you,” Wang says.

However, one limitation comes from the fact that they have to train an ensemble of foundation models, which is computationally high-priced. In the future, they plan to find more capable ways to build multiple models, perhaps by exploiting tiny perturbations of a single model.

“With the current trend of using foundational models for their embeddings to handle various downstream tasks—from fine-tuning to generating extended retrieval—the topic of uncertainty quantification at the representation level is becoming increasingly important but difficult, since the embeddings themselves have no grounding. Instead, what matters is how the embeddings of different inputs are related to each other, an idea that this work neatly captures with the proposed neighborhood coherence score,” says Marco Pavone, an assistant professor in the Department of Aeronautics and Astronautics at Stanford University, who was not involved in this work. “This is a promising step toward high-quality uncertainty quantification for embedding models, and I look forward to future extensions that can operate without the need for model assembly to truly enable this approach to scale to foundation-sized models.”

This work is funded in part by MIT-IBM Watson AI Lab, MathWorks, and Amazon.

Latest Posts

More News