Friday, May 9, 2025

Are you looking for a specific action in a movie? This AI-powered method can find it for you

Share

The Internet is full of instructional videos that can teach curious viewers everything from how to cook the perfect pancake to how to perform the life-saving Heimlich maneuver.

However, determining when and where specific action takes place in a long video can be tedious. To improve this process, scientists are trying to teach computers to perform this task. Ideally, the user could simply describe the action they’re looking for and the AI ​​model would jump to its location in the video.

However, teaching machine learning models how to do this typically requires vast amounts of pricey video data that has been carefully and manually labeled.

A novel, more effective approach developed by researchers at MIT and the MIT-IBM Watson AI Lab involves training a model to perform this task, called spatio-temporal grounding, using only videos and their automatically generated transcripts.

Scientists train the model to understand unlabeled video in two different ways: by looking at the fine details to learn where objects are (spatial information), and by looking at the bigger picture to understand when the action occurs (temporal information).

Compared to other AI approaches, their method more accurately identifies actions in longer videos containing many actions. Interestingly, they found that training both spatial and temporal information simultaneously makes the model better at identifying each person individually.

In addition to streamlining online learning and virtual training processes, this technique can also be useful in healthcare settings, for example by quickly finding key moments in videos of diagnostic procedures.

“We solve the challenge of trying to encode spatial and temporal information simultaneously and instead think of it as two experts working alone, which turns out to be a more explicit way of encoding the information. Our model, which combines these two distinct branches, provides the best performance,” says Brian Chen, lead author of a article about this technique.

Chen, a 2023 graduate of Columbia University who conducted this research while a visiting student at the MIT-IBM Watson AI Lab, was joined in the article by James Glass, senior research fellow, member of the MIT-IBM Watson AI Lab and head of the Spoken Language Systems Group at the Computer Science and Artificial Intelligence Laboratory (CSAIL); Hilde Kuehne, member of the MIT-IBM Watson AI Lab, also affiliated with Goethe University in Frankfurt; and others at MIT, Goethe University, MIT-IBM Watson AI Lab and Quality Match GmbH. The research results will be presented at the Conference on Computer Vision and Pattern Recognition.

Global and local learning

Scientists typically teach models to perform spatio-temporal grounding through videos in which people record the start and end times of particular tasks.

Not only is this data expensive to generate, but it can be difficult for people to know what exactly to flag. If the activity is “cooking the pancake”, does it start when the chef starts mixing the dough, or does he pour it into the pan?

“This time the task might be about cooking, but next time it might be about repairing a car. There are many different domains where people can annotate. But if we can learn everything without labels, that’s a more general solution,” Chen says.

In their approach, researchers use unlabeled instructional videos and accompanying text transcripts from sites such as YouTube as training data. They do not require any special preparation.

They divided the training process into two parts. First, they train a machine learning model to look at the entire video to understand what actions are taking place at specific moments. This high-level information is called the global representation.

Second, they train the model to focus on a specific region in the parts of the video where the action takes place. For example, in a large kitchen, a model might focus only on the wooden spoon that the chef uses to mix pancake batter, rather than on the entire counter. This detailed information is called local representation.

Researchers include an additional element in their framework to alleviate the discrepancies that exist between narrative and video. Perhaps the chef first talks about cooking the pancake and then performs the action.

To develop a more realistic solution, researchers focused on uncut videos of several minutes. In contrast, most AI techniques train using clips of a few seconds that someone has cropped to show just one action.

A new reference point

But when they went to evaluate their approach, the researchers couldn’t find an effective benchmark for testing the model on longer, uncropped videos — so they created one.

To build their comparative dataset, the researchers developed a new annotation technique that performs well at identifying multi-step activities. Instead of drawing a box around important objects, users were asked to mark the intersections of objects, such as the point where a knife blade cuts through a tomato.

“It is more clearly defined and speeds up the annotation process, which reduces the effort and human costs,” Chen says.

Furthermore, if multiple people do point annotations on the same video, you can better capture actions that happen over time, such as pouring milk. Not all annotators will mark the exact same point of liquid flow.

When they used this baseline to test their approach, researchers found that it was more true in determining actions than other AI techniques.

Their method also more effectively focused on human-object interactions. For example, if the action is “serve a pancake”, many other approaches may focus only on key objects, such as a stack of pancakes on the counter. Instead, their method focuses on the actual moment when the chef flips the pancake onto the plate.

Existing approaches rely heavily on labeled human data and are therefore not very scalable. This work takes a step toward solving this problem by providing novel methods for locating events in space and time using the speech that naturally occurs within them. This type of data is ubiquitous, so in theory it would be a powerful educational signal. However, it is often completely unrelated to what is on the screen, making it challenging to utilize in machine learning systems. This work helps address that problem by making it easier for researchers to create systems that utilize this form of multimodal data in the future, says Andrew Owens, an assistant professor of electrical engineering and computer science at the University of Michigan, who was not involved in the work.

Next, the researchers plan to refine their approach so that the models can automatically detect when text and narrative are misaligned and switch attention from one modality to the other. They also want to extend their framework to audio data, as there are usually forceful correlations between actions and sounds made by objects.

“Artificial intelligence research has made incredible progress in creating models like ChatGPT that understand images. However, our progress in understanding video is far behind. This work is a significant step forward in that direction,” says Kate Saenko, a professor at Boston University’s Department of Computer Science, who was not involved in the work.

This research is funded in part by the MIT-IBM Watson AI Lab.

Latest Posts

More News