In high-stakes applications such as medical diagnostics, users often want to know what prompted a computer vision model to make a particular prediction so they can determine whether its results can be trusted.
Concept bottleneck modeling is one method that enables artificial intelligence systems to explain the decision-making process. These methods force the deep learning model to operate a set of concepts that humans can understand to make predictions. In fresh research, computer scientists at MIT have developed a method that nudges a model to achieve greater accuracy and clearer, more concise explanations.
The concepts used in the model are usually defined in advance by experts. For example, a doctor may suggest using terms such as “clustered brown dots” and “various pigmentation” to predict that the medical image shows melanoma.
However, previously defined concepts may be irrelevant or lack sufficient detail for a specific task, reducing the accuracy of the model. The fresh method extracts concepts that the model has already learned during training for that specific task and forces the model to operate them, creating better explanations than standard concept bottleneck models.
This approach uses a pair of specialized machine learning models that automatically extract knowledge from the target model and translate it into plain language concepts. Ultimately, their technique can transform any previously trained computer vision model into one that can operate concepts to explain its reasoning.
“In a sense, we want to be able to read the minds of these computer vision models. The concept bottleneck model is one way for users to tell what the model is thinking and why it made a certain prediction. Because our method uses better concepts, it can lead to greater accuracy and ultimately improve the accountability of black box AI models,” says lead author Antonio De Santis, a graduate student at the Polytechnic University of Milan, who completed this research while a visiting graduate student at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL).
It’s attached to article about work Schrasing Tong SM ’20, PhD ’26; Marco Brambilla, professor of computer science and engineering at the Politecnico di Milano; and senior author Lalana Kagal, principal investigator at CSAIL. The research results will be presented at the International Conference on Learning Representations.
Creating a better bottleneck
Conceptual Bottleneck Models (CBM) are a popular approach to improving the explainability of AI. These techniques add an intermediate step by having the computer vision model predict the concepts present in the image and then operate those concepts to make the final prediction.
This intermediate step, or bottleneck, helps users understand the model’s reasoning.
For example, a model identifying bird species could select terms such as “yellow legs” and “blue wings” before predicting a barn swallow.
However, because these concepts are often generated in advance by humans or vast language models (LLMs), they may not be suitable for a specific task. Furthermore, even if a model has a set of predefined concepts, it sometimes still uses unwanted learned information, a problem known as information leakage.
“These models are trained to maximize performance, so the model may secretly use concepts that we are not aware of,” De Santis explains.
The MIT researchers had a different idea: because the model was trained on a huge amount of data, it could learn the concepts necessary to generate true predictions for a specific task. They tried to build a CBM by extracting existing knowledge and transforming it into human-readable text.
In the first step of its method, a specialized deep learning model, called a scant autoencoder, selectively takes the most relevant features that the model has learned and reconstructs them into several concepts. The multimodal LLM then describes each concept in plain language.
This multimodal LLM also annotates images in the dataset, identifying which concepts are present and which are absent in each image. Researchers operate this annotated dataset to train a concept bottleneck module for concept recognition.
They incorporate this module into the target model, forcing it to make predictions using only the set of learned concepts extracted by researchers.
Controlling the concept
In developing this method, they overcame many challenges, from ensuring the validity of LLM annotated concepts to determining whether the scant autoencoder identified human-readable concepts.
To prevent the model from using unfamiliar or undesirable concepts, they limit it to using only five concepts for each prediction. This also forces the model to choose the most appropriate concepts and makes explanations more understandable.
When they compared their approach to state-of-the-art CBMs on tasks such as predicting bird species and identifying skin lesions in medical images, their method achieved the highest accuracy while providing more precise explanations.
Their approach also generated concepts that were more applicable to the images in the dataset.
“We have shown that extracting concepts from the original model can outperform other CBMs, but there is still a trade-off between interpretability and accuracy that needs to be addressed. Black box models that cannot be interpreted are still superior to ours,” says De Santis.
In the future, researchers want to explore potential solutions to the information leakage problem, perhaps by adding additional concept bottleneck modules to prevent unwanted concepts from leaking. They also plan to scale up their method by using a larger multimodal LLM to describe a larger training dataset, which could augment performance.
“I’m excited about this work because it pushes interpretable AI in a very promising direction and creates a natural bridge to symbolic AI and knowledge graphs,” says Andreas Hotho, professor and head of the Department of Data Science at the University of Würzburg, who was not involved in the work. “By extracting conceptual bottlenecks from the internal workings of the model, rather than just from human-defined concepts, it offers a path to explanations that are more faithful to the model and opens up many opportunities for further work with structured knowledge.”
This research was supported by a Progetto Rocca PhD scholarship, the Italian Ministry of Universities and Research under the National Recovery and Resilience Plan, Thales Alenia Space and the European Union under the NextGenerationEU project.
