Monday, December 23, 2024

Hallucinations about better translation of text

Share

As children, we babble and imitate how languages ​​are learned. We do not start by reading raw text, which requires basic knowledge and understanding of the world, as well as advanced ability to interpret and reason about descriptions and relationships. Rather, humans begin our linguistic journey slowly, pointing to and interacting with our surroundings, grounding our words and perceiving their meaning in the context of the physical and social world. We can finally create complete sentences to convey elaborate ideas.

Similarly, when people begin to learn and translate into another language, the inclusion of other sensory information, such as multimedia, combined with fresh and unfamiliar words, such as picture flashcards, improves language acquisition and memory. Then, with enough practice, people will be able to accurately translate fresh, unseen sentences in context without the accompanying media; however, it is helpful to imagine the image based on the original text.

This is the basis of a fresh machine learning model called VALHALLA, developed by researchers at MIT, IBM and the University of California, San Diego, in which a trained neural network sees a source sentence in one language, hallucinates an image of what it looks like, and then uses both to translate into the target language. The team found that their method provided greater accuracy in machine translation compared to pure text translation. Moreover, it provided additional support in cases where long sentences occur, in cases where resource-insufficient languages ​​exist, and in cases where part of the source sentence is unavailable to the machine translator.

Machine translation, a core task in the field of artificial intelligence-based natural language processing (NLP), is “an incredibly practical technology that millions of people use every day,” says study co-author Yoon Kim, an assistant professor in MIT’s Department of Electrical Engineering and Computer Science with ties to Computer Science and Artificial Intelligence Laboratory (CSAIL) and MIT-IBM Watson AI Lab. With recent significant advances in deep learning, “there has been an interesting development in the ability to use non-textual information—for example, images, audio, or other basic information—to solve practical language tasks,” says Kim, because “when people perform language processing tasks, language, we do it in a grounded, situated world. The team postulated that combining hallucinatory images and text during inference mimics this process, providing context that enables improved performance compared to current state-of-the-art techniques that exploit only text data.

This research will be presented this month at the IEEE/CVF Computer Vision and Pattern Recognition Conference. Kim’s co-authors include UC San Diego doctoral student Yi Li and professor Nuno Vasconcelos, as well as research fellows Rameswar Panda, Chun-fu “Richard” Chen, Rogerio Feris, and IBM executive David Cox of IBM Research and the MIT-IBM Watson AI Laboratory.

Learning hallucinations from images

When we learn fresh languages ​​and translate, we are often given examples and exercises before venturing out on our own. The same applies to machine translation systems; However, if images are used in training, these AI methods also require visual aids for testing, which limits their application, Panda says.

“In real-life scenarios, you may not have an image with respect to the source sentence. So our motivation was basically that instead of using an external image as input during inference, could we exploit visual hallucinations – the ability to imagine visual scenes – to improve machine translation systems? says Panda.

To do this, the team used a dual-transformer encoder-decoder architecture, a type of neural network model that is suitable for sequence-dependent data such as language that can pay attention to keywords and sentence semantics. One transformer generates visual hallucinations, and the second transformer performs multimodal translation using the output signals from the first transformer.

During training, there are two translation streams: the source sentence and its paired factual image, and the same source sentence that is visually hallucinated to create a text-image pair. First, the truth image and the sentence are tokenized into representations that can be handled by transformers; in the case of a sentence, each word is a symbol. The source sentence is tokenized again, but this time it passes through the visual hallucination transformer, generating a hallucination, discrete representation of the sentence. The researchers used autoregression, which compares factual representations and hallucinations for consistency – e.g., homonyms: a reference to an animal “bat” is not a baseball bat-like hallucination. The hallucination transformer then uses the difference between them to optimize its predictions and visual results, ensuring that the context is consistent.

Two sets of tokens are then simultaneously passed through a multimodal translation transformer, each containing a sentence representation and a hallucinatory or ground truth image. The translation results of the tokenized text are compared to ensure similarity to each other and to the target sentence in the other language. Any differences are then fed back to the translation transformer for further optimization.

For testing purposes, the stream of real-world images drops because the images would likely not be available in everyday situations.

“To our knowledge, we haven’t seen any work that actually uses a hallucinatory transformer in combination with a multimodal translation system to improve machine translation performance,” Panda says.

Visualization of the target text

To test their method, the team compared VALHALLA with other state-of-the-art multimodal and text-only translation methods. They used public benchmark datasets containing realistic images with source sentences and a dataset to translate text-only articles. The researchers measured its performance in 13 tasks, ranging from translations into high-resource languages ​​(such as English, German and French), under-resourced languages ​​(such as English to Romanian), and non-English languages ​​(such as Spanish to French). The group also tested different sizes of transformer models, changes in accuracy with sentence length, and translation in a circumscribed text context where parts of the text were hidden from machine translators.

The team observed significant improvements over text-only translation methods, improved data performance, and that smaller models performed better than the larger base model. As the sentences became longer, the effectiveness of the VALHALL method increased compared to the other methods, which the researchers attributed to the addition of more ambiguous words. In cases where part of the sentence was masked, VALHALLA was able to recover and translate the original text, surprising the team.

There were further unexpected findings: “Where there wasn’t as much training [image and] text pairs, [like for under-resourced languages]the improvements were more significant, indicating that image embedding helps in low-data systems,” says Kim. “Another thing that was quite surprising to me was the better performance, even for text types that are not necessarily easy to combine with images. Maybe this won’t be so surprising if it helps translate visually expressive sentences like “there’s a red car in front of the house.” [However]even in text version [news article] domains, this approach has been improved for text-based systems.”

While VALHALLA works well, the researchers note that it has limitations in that it requires an image to describe pairs of sentences, which can make it more steep to obtain. It also works better in the ground domain rather than text articles. Moreover, as Kim and Panda note, a technique like VALHALLA is still a black box, assuming hallucinatory images provide useful information, and the team plans to study what and how the model learns to validate their methods.

In the future, the team plans to explore other ways to improve the translation. “Here we are focusing solely on images, but there are other types of multimodal information—for example, speech, video, or touch, or other sensory modalities,” Panda says. “We believe that such multimodal grounding can lead to even more efficient machine translation models, potentially benefiting translation into many of the world’s low-resource languages.”

This research was supported in part by the MIT-IBM Watson AI Lab and the National Science Foundation.

Latest Posts

More News