MIT researchers have identified significant examples of machine learning models failing when those models are applied to data other than what they were trained on, raising questions about the need for testing every time a model is deployed in a modern setting.
“We show that even when models are trained on large amounts of data and selecting the best average model under new conditions, this ‘best model’ may be the worst model for 6 to 75 percent of the new data,” says Marzyeh Ghassemi, associate professor in MIT’s Department of Electrical Engineering and Computer Science (EECS), member of the Institute for Medical Engineering and Science, and principal investigator in the Information and Decision Systems Laboratory.
In paper presented at the Neural Information Processing Systems (NeurIPS 2025) conference in December, the researchers point out that models trained to effectively diagnose diseases based on, for example, a chest X-ray in one hospital may, on average, be considered effective in another hospital. However, the researchers’ evaluation of the results revealed that some of the best-performing models in the first hospital performed the worst for as many as 75 percent of the patients in the second hospital, even though when all patients in the second hospital are aggregated, the high average performance masks this failure.
Their findings show that while spurious correlations – a basic example is when a machine learning system that has not “seen” many cows on the beach classifies a photo of a cow going to the beach as a killer whale simply because of its origin – are thought to be mitigated only by improving model performance based on observed data, they actually still occur and remain a threat to the validity of the model in modern settings. In many cases – including areas that researchers study such as chest X-rays, histopathological images of cancer, and detecting hate speech – such spurious correlations are much more arduous to detect.
For example, in the case of a medical diagnostics model trained on chest X-rays, the model could learn to correlate a specific and irrelevant marking on one hospital’s X-rays with a specific pathology. In another hospital that does not apply labeling, this pathology could be missed.
Previous research by Ghassemi’s group has shown that models can falsely correlate factors such as age, gender and race with medical test results. For example, if the model was trained on more chest X-rays of older people with pneumonia and “didn’t see” as many X-rays of younger people, it could predict that only older patients have pneumonia.
“We want the models to learn to look at a patient’s anatomical features and then make a decision based on that,” says Olawale Salaudeen, an MIT postdoc and lead author of the paper, “but really anything in the data that is correlated with a decision can be used by the model. And those correlations may not actually be robust to changes in the environment, making the model’s predictions unreliable sources of decision-making.”
Incorrect correlations raise the risk of biased decision-making. In the NeurIPS conference paper, researchers showed that, for example, chest X-ray models that improved overall diagnosis performance actually produced worse results in patients with pleural disease or an enlarged mediastinum, which means an enlargement of the heart or central chest cavity.
Other authors of the paper were graduate students Haoran Zhang and Kumail Alhamoud, EECS assistant professor Sara Beery and Ghassemi.
Although previous work generally assumed that models ordered from best to worst by performance would maintain that order when applied to modern settings called online accuracy, the researchers were able to demonstrate examples of situations in which models with the best performance in one setting were the worst in another.
Salaudeen developed an algorithm called OODSelect to find examples where accuracy was compromised on the line. Basically, he trained thousands of models using data from the distribution, meaning the data came from the first setup, and calculated their accuracy. He then applied the models to the data from the second setting. When people who were most precise on data in the first setting were wrong when applied to a enormous percentage of examples in the second setting, this identified problematic subsets, or subpopulations. Salaudeen also highlights the dangers of using summary statistics for evaluation, which can unknown more detailed and consistent information about model performance.
During their work, the researchers isolated the “most miscalculated examples” to avoid conflating spurious correlations in the dataset with situations that are simply arduous to classify.
The NeurIPS article provides the researchers’ code and some identified subsets for future work.
When a hospital or any organization using machine learning identifies subsets where a model performs poorly, this information can be used to improve the model for the specific task and environment. The researchers recommend that future work apply OODSelect to highlight evaluation goals and design approaches for more consistent performance improvement.
“We hope that the released OODSelect code and subsets will serve as a stepping stone,” the researchers write, “towards benchmarks and models that confront the adverse effects of spurious correlations.”
