When it comes to AI, appearances can be deceiving. The mystery surrounding the inner workings of immense language models (LLMs) stems from their massive size, complicated training methods, hard-to-predict behavior, and elusive interpretability.
Researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) recently peered through a proverbial magnifying glass to examine how LLMs performed on a variety of tasks, revealing intriguing insights into the interplay between memory and reasoning skills. It turns out that their reasoning abilities are often overestimated.
The study compared “default tasks,” the typical tasks that the model is trained and tested on, with “counterfactuals,” hypothetical situations that deviate from the default conditions that models like GPT-4 and Claude would typically be expected to handle. The researchers designed several tests that went beyond the models’ comfort zones, modifying existing tasks rather than creating entirely modern ones. They used different datasets and benchmarks specifically tailored to different aspects of the models’ capabilities in domains such as arithmetic, chess, code evaluation, answering logic questions, and more.
When users interact with language models, any arithmetic is typically in base 10, the familiar base for these models. However, observing them perform well in base 10 may give us the false impression that they have mighty addition skills. Logically, if they really do have good addition skills, one would expect reliably high performance in all bases, similar to calculators or computers. Indeed, research has shown that these models are not as hearty as many initially think. Their high performance is confined to typical task variants and they suffer from a consistent and severe performance drop in unfamiliar counterfactuals, suggesting a lack of generalizable addition skills.
This pattern held true for many other tasks, such as musical chord fingering, spatial reasoning, and even chess problems in which the initial positions of the pieces were slightly changed. While human players are expected to be able to determine the legality of moves in changed scenarios (given enough time), the models struggled and could not perform better than random guessing, meaning they were confined in their ability to generalize to unfamiliar situations. And a immense part of their performance on standard tasks is likely not due to general task skills but to overfitting or direct memorization of what they saw in their training data.
“We discovered a fascinating aspect of large language models: they perform well in familiar scenarios, almost like a well-trodden path, but they struggle when the terrain becomes unfamiliar. This knowledge is crucial as we strive to make these models more adaptable and expand their application horizons,” says Zhaofeng Wu, an MIT doctoral student in electrical engineering and computer science, a CSAIL collaborator, and lead author of the modern paper. paper about the research. “As AI becomes more pervasive in our society, it must reliably handle a variety of scenarios, whether they are known or not. We hope that these insights will one day be used to design future LLMs with improved robustness.”
Despite the insights gained, there are of course limitations. The study’s focus on specific tasks and settings did not capture the full range of challenges that models could potentially face in real-world applications, which signals the need for more diverse testing environments. Future work could include expanding the range of tasks and counterfactuals to uncover more potential weaknesses. This could mean looking at more complicated and less common scenarios. The team also wants to improve interpretability by creating methods that better understand the logic behind decision-making processes in models.
“As language models become more sophisticated, understanding their training data becomes increasingly difficult, even for open models, let alone proprietary ones,” says Hao Peng, an assistant professor at the University of Illinois at Urbana-Champaign. “The community remains confused about whether these models actually generalize to unseen tasks or whether they are just superficially successful by memorizing training data. This paper takes an important step in answering that question. It constructs a set of carefully designed counterfactual evaluations, providing new insights into the capabilities of state-of-the-art LLMs. It reveals that their ability to solve unseen tasks is likely much more limited than many expected. It has the potential to inspire future research to identify failure modes of today’s models and develop better models.”
The other authors are Najoung Kim, an assistant professor at Boston University and a visiting researcher at Google, and seven CSAIL faculty members: MIT PhD students in electrical engineering and computer science (EECS) Linlu Qiu, Alexis Ross, Ekin Akyürek SM ’21, and Boyuan Chen; former postdoctoral fellow and Apple AI/ML researcher Bailin Wang; and EECS assistant professors Jacob Andreas and Yoon Kim.
The team’s research was supported in part by the MIT–IBM Watson AI Lab, MIT Quest for Intelligence, and the National Science Foundation. The team presented the work at the North American Chapter of the Association for Computational Linguistics (NAACL) last month.