As AI models become more common and integrated into various sectors such as healthcare, finance, education, transportation, and entertainment, understanding how they work under the hood is crucial. Interpreting the mechanisms behind AI models allows us to audit them for security and bias, potentially deepening our understanding of the science behind intelligence itself.
Imagine if we could directly study the human brain, manipulating each of its neurons to investigate their role in perceiving a particular object. While such an experiment would be impossibly invasive in a human brain, it is more feasible in a different type of neural network: an artificial one. However, as with the human brain, artificial models containing millions of neurons are too vast and complicated to study by hand, making large-scale interpretability a very arduous task.
To address this, researchers at MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) decided to take an automated approach to interpreting AI models that evaluate various properties of images. They developed “MAIA” (Multimodal Automated Interpretability Agent), a system that automates various interpretability tasks in neural networks using a vision-language model framework equipped with tools for experimenting on other AI systems.
“Our goal is to create an AI researcher that can autonomously perform interpretability experiments. Existing automated interpretability methods merely label or visualize data in a one-time process. On the other hand, MAIA can generate hypotheses, design experiments to test them, and refine its understanding through iterative analysis,” says Tamar Rott Shaham, a postdoctoral student in electrical engineering and computer science (EECS) at MIT at CSAIL and co-author of the modern paper. article about research“By combining a pre-trained vision-language model with a library of interpretive tools, our multimodal method can respond to user queries by creating and running targeted experiments on specific models, continuously refining its approach until it can provide a comprehensive answer.”
The automated agent is tasked with three key tasks: labeling individual components within vision models and describing the visual concepts that activate them, sanitizing image classifiers by removing irrelevant features to make them more resilient to modern situations, and searching for hidden biases in AI systems to aid uncover potential fairness issues in their results. “But a key advantage of a system like MAIA is its flexibility,” says Sarah Schwettmann, a 2021 PhD student, a research scientist at CSAIL and co-lead investigator on the study. “We’ve demonstrated MAIA’s utility for a few specific tasks, but given that the system is built on a base model with broad reasoning capabilities, it can answer many different types of interpretability queries from users and design experiments on the fly to investigate them.”
Neuron by neuron
In one example task, the user asks MAIA to describe the concepts that a specific neuron in a vision model is responsible for detecting. To investigate this question, MAIA first uses a tool that pulls “example datasets” from the ImageNet dataset that maximally activate the neuron. In the case of this example neuron, these images show people in formal attire and close-ups of their chins and necks. MAIA makes different hypotheses about what’s driving the neuron’s activity: facial expressions, chins, or ties. MAIA then uses its experiment design tools to test each hypothesis individually by generating and editing synthetic images—in one experiment, adding a bow tie to an image of a human face boosted the neuron’s response. “This approach allows us to pinpoint the specific cause of the neuron’s activity, much like in a real science experiment,” says Rott Shaham.
MAIA explanations of neuronal behavior are evaluated in two key ways. First, synthetic systems with known ground-truth behaviors are used to evaluate the accuracy of MAIA interpretations. Second, for “real” neurons within trained AI systems without ground-truth descriptions, the authors design a modern automated evaluation protocol that measures how well MAIA descriptions predict neuronal behavior from unseen data.
The CSAIL-based method outperformed baseline methods describing individual neurons in various vision models, such as ResNet, CLIP, and Vision Transformer DINO. MAIA also performed well on a modern dataset of synthetic neurons with known ground-truth descriptions. For both real and synthetic systems, the descriptions were often comparable to those written by experts.
How useful are descriptions of AI system components, such as individual neurons? “Understanding and localizing behaviors within large AI systems is a key part of auditing the security of those systems before they are deployed—in some of our experiments, we show how MAIA can be used to find neurons with undesirable behaviors and remove those behaviors from the model,” Schwettmann says. “We’re building a more resilient AI ecosystem, where the tools for understanding and monitoring AI systems keep up with the scale of the system, allowing us to explore and hopefully understand unforeseen challenges introduced by new models.”
We look into neural networks
The emerging field of interpretability is maturing into a distinct research area with the rise of “black box” machine learning models. How can researchers decipher these models and understand how they work?
Current methods of looking inside are typically narrow in the scale or precision of the explanations they can generate. In addition, existing methods are typically tailored to a specific model and a specific task. This has led researchers to ask: How can we build a general system that helps users answer questions about the interpretability of AI models while combining the flexibility of human experiments with the scalability of automated techniques?
One critical area they wanted to address with this system was bias. To determine whether image classifiers were biased toward specific subcategories of images, the team looked at the last layer of the classification stream (the system designed to sort or label items, much like a machine identifying whether a photo is a dog, cat, or bird) and the probability scores of the input images (the confidence levels the machine assigns to its guesses). To understand potential bias in image classification, MAIA was asked to find a subset of images in specific classes (for example, “Labrador retriever”) that were likely to be mislabeled by the system. In this example, MAIA found that images of black Labradors were likely to be misclassified, suggesting a bias in the model toward yellow-furred retrievers.
Because MAIA relies on external tools to design experiments, its performance is narrow by the quality of those tools. However, as the quality of tools such as image synthesis models improves, so will MAIA. MAIA also sometimes exhibits confirmation bias, in which it sometimes incorrectly confirms its initial hypothesis. To address this, the researchers built an image-to-text tool that uses another instance of the language model to summarize the results of the experiments. Another failure mode is overfitting to a specific experiment, in which the model sometimes draws premature conclusions based on minimal evidence.
“I think the natural next step for our lab is to move beyond artificial systems and apply similar experiments to human perception,” says Rott Shaham. “Testing this has traditionally required designing and testing stimuli by hand, which is laborious. With our agent, we can scale up that process by designing and testing multiple stimuli at once. It could also allow us to compare human visual perception with artificial systems.”
“Understanding neural networks is difficult for humans because they have hundreds of thousands of neurons, each with complex behavioral patterns. MAIA helps connect the dots by developing AI agents that can automatically analyze those neurons and communicate their conclusions to humans in a way that’s digestible,” says Jacob Steinhardt, an assistant professor at the University of California, Berkeley, who was not involved in the study. “Scaling these methods could be one of the most important ways to understand and safely supervise AI systems.”
Rott Shaham and Schwettmann are joined in this article by five other CSAIL members: undergraduate student Franklin Wang; modern MIT graduate student Achyuta Rajaram; EECS doctoral student Evan Hernandez SM ’22; and EECS professors Jacob Andreas and Antonio Torralba. Their work was supported in part by the MIT-IBM Watson AI Lab, Open Philanthropy, Hyundai Motor Co., the Army Research Laboratory, Intel, the National Science Foundation, the Zuckerman STEM Leadership Program, and a Viterbi Fellowship. The researchers’ findings will be presented at the International Conference on Machine Learning this week.