Thursday, May 8, 2025

The modern algorithm discovers the language simply by watching videos

Share

Mark Hamilton, an MIT graduate student in electrical engineering and computer science and a member of MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), wants to apply machines to understand how animals communicate. For this purpose, he first decided to create a system that would be able to learn human language “from scratch”.

“Funnily enough, the key moment of inspiration came from the movie March of the Penguins.” There is a scene where the penguin falls while crossing the ice and when he gets up he lets out a miniature groan. When you watch it, it’s almost obvious that this moan stands for a four-letter word. That was the moment when we thought maybe we should apply sound and image to learn language,” says Hamilton. “Is there a way we could let an algorithm watch TV all day long and figure out what we’re talking about?”

“Our “DenseAV” model aims to learn language by predicting what it sees from what it hears, and vice versa. For example, if you hear a sound that says “bake a cake at 350 degrees,” you are probably seeing a cake or an oven. “To succeed in this game of matching audio and video files to millions of videos, the model needs to learn what people are talking about,” says Hamilton.

After training DenseAV on this matching game, Hamilton and his colleagues checked what pixels the model was looking for when it heard a sound. For example, when someone says “dog”, the algorithm immediately starts looking for dogs in the video stream. By checking which pixels are selected by the algorithm, you can discover what the algorithm thinks a word means.

Interestingly, a similar search process occurs when DenseAV listens to a dog barking: it searches for a dog in the video stream. “This piqued our interest. “We wanted to see if the algorithm could recognize the difference between the word ‘dog’ and a dog barking,” says Hamilton. The team explored this issue by giving DenseAV a “double-sided brain.” Interestingly, they found that one side of the DenseAV brain naturally focuses on language such as the word “dog” and the other side on sounds such as barking. This showed that DenseAV not only learned word meanings and sound locations, but also learned to distinguish these types of cross-modal connections, all without human intervention or knowledge of written language.

One branch of the application learns from the huge volumes of video files posted on the Internet every day: “We want systems that can learn from huge amounts of video content, such as instructional videos,” says Hamilton. “Another exciting application is understanding new languages, such as dolphins and whales, that have no written form of communication. We hope that DenseAV will help us understand those languages ​​that have historically eluded translation efforts. Finally, we hope that this method can be used to discover patterns between other pairs of signals, e.g., seismic sounds produced by the Earth and its geological structure.”

The team faced a huge challenge: learning a language without entering text. Their goal was to rediscover the meaning of language from a neat slate, avoiding the apply of pre-trained language models. This approach is inspired by the way children learn by observing and listening to their surroundings to understand language.

To achieve this goal, DenseAV uses two main components to process audio and visual data separately. This separation prevented the algorithm from cheating by allowing the visual side to look at the audio and vice versa. This forced the algorithm to recognize objects and created detailed and meaningful features for both audio and visual signals. DenseAV learns by comparing pairs of audio and visual signals to find out which signals match and which do not. This method, called contrastive learning, does not require labeled examples and allows DenseAV to discover critical predictive patterns of the language itself.

One of the main differences between DenseAV and previous algorithms is that previous work focused on a single concept of similarity between audio and video. The entire audio clip, for example someone saying “the dog sat on the grass”, was matched to the entire image of the dog. This prevented previous methods from detecting miniature details, such as how the word “grass” was associated with the grass beneath the dog. The team’s algorithm finds and aggregates all possible matches between the audio clip and image pixels. This not only improved performance, but allowed the team to precisely locate sounds in a way that previous algorithms could not. “Conventional methods use a single-class token, but our approach compares every pixel and every second of audio. This detailed method allows DenseAV to create more detailed connections for better localization,” says Hamilton.

The researchers trained DenseAV on AudioSet, which contains 2 million YouTube videos. They also created modern datasets to test how well the model could combine sounds and images. In these tests, DenseAV outperformed other leading models in tasks such as recognizing objects based on their names and sounds, proving its effectiveness. “Previous datasets only supported rough evaluations, so we created the dataset using semantic segmentation datasets. This helps in creating precise annotations to precisely evaluate the performance of our model. We can stimulate the algorithm with specific sounds or images and obtain detailed locations,” says Hamilton.

Due to the huge amount of data, the project took about a year to complete. The team says that switching to a large transformer architecture was a challenge because these models can easily miss small details. A significant hurdle was getting the model to focus on these details.

Looking to the future, the team aims to create systems that can learn from massive amounts of data that contain only video or audio. This is crucial for new domains where multiple modes exist but not together. They also aim to scale up this phenomenon using larger frameworks and possibly integrate knowledge from language models to improve performance.

“Recognizing and segmenting visual objects in images, as well as ambient sounds and spoken words in audio recordings, are difficult problems in themselves. “Historically, researchers have relied on expensive human annotations to train machine learning models to perform these tasks,” says David Harwath, an assistant professor of computer science at the University of Texas at Austin, who was not involved in the work. “DenseAV is making significant progress in developing methods that can learn to solve these tasks simultaneously by simply observing the world through sight and sound – based on the insight that the things we see and interact with often make sounds, and we use spoken language to talk about them. This model also makes no assumptions about the specific language spoken and can therefore, in principle, learn from data in any language. It would be exciting to see what DenseAV could learn by scaling it to thousands or millions of hours of video data in multiple languages.”

Additional authors on a document describing the work is Andrew Zisserman, professor of computer vision engineering at the University of Oxford; John R. Hershey, Google AI Perception researcher; and William T. Freeman, professor of electrical engineering and computer science at MIT and principal investigator of CSAIL. Their research was supported in part by the US National Science Foundation, a Royal Society Research Professorship, and a Visual AI grant from the EPSRC program. This work will be presented this month at the IEEE/CVF Computer Vision and Pattern Recognition Conference.

Latest Posts

More News