Wednesday, May 14, 2025

The study shows that models in the language of vision cannot support inquiries using negative words

Share

Imagine that a radiologist examining the x -ray of the chest from a fresh patient. He notes that the patient has swelling in the tissue, but he has no enlarged heart. To accelerate the diagnosis, it can utilize a machine learning model in the vision language to search for reports of similar patients.

But if the model erroneously identifies reports with both conditions, the most likely diagnosis may be completely different: if the patient has swelling of the tissue and the enlarged heart, this condition is very likely, but without enlarged heart there may be several reasons underlying.

In a fresh study, myth scientists have found that vision models are very likely that they will make such a mistake in real situations, because they do not understand negation-words such as “no” and “no”, which determine what is false or absent.

“These negative words can have a very significant impact, and if we just use these models blindly, we can encounter catastrophic consequences,” says Kumail Alhamoud, MIT graduate and main author This is a study.

Scientists tested models in the vision language to identify negation in the signatures of the image. Frequently made models, as well as a random guess. Based on these arrangements, the team has created a set of data data with relevant inscriptions that contain negative words describing the missing objects.

They show that retraining the model in the vision language using this set of data leads to performance improvements when you are asked to download images that do not contain some objects. It also increases the accuracy of multiple choice questions by answering negated signatures.

But scientists warn that more work is needed to solve the basic reasons for this problem. They hope that their research warns potential users about the previously unnoticed defect, which can have sedate consequences in high rate settings, in which these models are currently used, from the arrangement which patients receive certain treatment after identifying product defects in production plants.

“This is a technical article, but there are greater problems to consider. If something as fundamental as negation is broken, we should not use large vision/languages ​​models in many ways in which we use them – without intensive assessment,” says the elder author of Marzeeh Ghassemi, an extraordinary professor at the Department of Electrical Engineering and Computer Science (EEC) and a member of the Medical Institute and Engineering Institute engineering and laboratory.

Ghassemi and Alhamoud join the Shaden Alshammari newspaper, a graduate of MIT; Yonglong Tian from Openai; Guohao Li, former postdoc at Oxford University; Philip HS Torr, professor in Oxford; and Yoon Kim, assistant professor EECS and a member of the Computer Science and Artificial Intelligence Laboratory (CSAIL) in MIT. The research will be presented at a conference on a computer vision and recognition of patterns.

Neglect of negation

Models in vision (VLM) are trained using huge collections of paintings and appropriate signatures that learn to cod as sets of numbers, called vector representations. Models utilize these vectors to distinguish between different images.

VLM uses two separate encoders, one for the text and one for images, and the encoders learn to display similar vectors for the image and the corresponding text signature.

“The signatures express what is in the pictures – they are a positive label. And this is actually the whole problem. Nobody looks at the image of a dog jumping over the fence and signs it, saying:” The dog jumping over the fence without helicopters, “says Ghassemi.

Because data sets with an image do not contain examples of negation, VLM never learns to identify it.

To delve into this problem, scientists designed two comparative tasks that test VLM’s ability to understand negation.

First of all, they used a large language model (LLM) to re -cut images in an existing set of data, asking LLM to think about related objects not in the picture and writing them for signature. Then they tested the models, suggesting them with negative words to download images containing specific objects, but not different.

In the case of the second task, they designed multiple choice questions that VLM asks to select the most appropriate signature from the list of closely related options. These signatures differ only by adding a reference to an object that does not appear in the picture or denying the object that appears in the picture.

Models often failed in both tasks, and the image download efficiency dropped by almost 25 percent with negated signatures. When it comes to answering multiple choice questions, the best models have only reached about 39 percent of accuracy, and several models worked on a random chance or even below.

One of the reasons for this failure is the abbreviation, which the scientists call the affirmation prejudice – VLM ignore negative words and focus on objects in the paintings.

“This does not simply happen in words such as” no “and” no “. Regardless of how you express negation or exclusion, the models simply ignore them, “says Alhamoud.

This was consistent in every VLM they tested.

“Problem with widespread”

Because VLM is usually not trained in the scope of image signatures with negation, scientists have developed data sets with negation words as the first step towards solving the problem.

Using a set of data with 10 million pairs of image signatures, they prompted LLM to propose related signatures that determine what is excluded from images, giving new signatures by means of negative words.

They had to be particularly careful that these synthetic signatures continued to read naturally, or could cause VLM to advance in the real world when they are in the face of more complex signatures written by people.

They discovered that the final VLM with their set of data led to an increase in performance around the world. It improved the ability to download models image by about 10 percent, while increasing the efficiency in the task of answering about 30 percent.

“But our solution is not perfect. We simply summarize data sets, a form of data enlargement. We haven’t even touched the operation of these models, but we hope that this is a signal that this is a problem with the solution, and others can take our solution and improve them,” says Alhamoud.

At the same time, he hopes that their work will encourage more users to think about the problem they want to utilize VLM to solve and design some examples to test it before implementation.

In the future, scientists can develop this work by teaching VLM to separate text and images, which can improve their ability to understand negation. In addition, they can develop additional sets of data that contain couples with an image for specific applications, such as healthcare.

Latest Posts

More News