The ability to reason abstractly about events as they unfold is a defining feature of human intelligence. We instinctively know that crying and writing are ways of communicating, and that a panda falling from a tree and a plane landing are different variants of descent.
Organizing the world into abstract categories is not uncomplicated for computers, but in recent years, researchers have gotten closer to this problem by training machine learning models on words and images that contain structural information about the world and the connections of objects, animals and actions. In a modern study conducted this month at the European Conference on Computer Vision, researchers presented a hybrid language vision model that can compare and contrast a set of energetic events captured on video to learn the high-level concepts that connect them.
Their model performed as well as or better than humans on two types of visual reasoning tasks—selecting the video that conceptually best complements a set and selecting a video that doesn’t. For example, when presented with videos of a barking dog and a man howling next to the dog, the model completed the set by selecting a crying baby from a set of five videos. The researchers replicated their results on two datasets for training AI systems to recognize actions: MIT Many moments in time and DeepMind Kinetics.
“We show that you can build abstraction into an artificial intelligence system to perform common visual reasoning tasks at a human-like level,” says the study’s senior author I hear Oliwa, senior research fellow at MIT, co-director of the MIT Quest for Intelligence and director of the MIT-IBM Watson AI Lab. “A model that can recognize abstract events will provide more accurate, logical predictions and be more useful in decision-making.”
As deep neural networks become experts at recognizing objects and actions in photos and videos, researchers have focused on the next milestone: abstractions and training models that will allow them to understand what they see. In one approach, researchers combined the pattern-matching ability of deep networks with the logic of symbolic programs to train a model to interpret intricate relationships between objects in a scene. Here, in a different approach, researchers exploit relationships rooted in word meanings to give their model visual reasoning power.
“Linguistic representations allow us to integrate contextual information obtained from text databases with our visual models,” says the study’s co-author Mateusz Monfort, a research scientist at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL). “Words like “running,” “weightlifting,” and “boxing” share certain characteristics that make them closer to the concepts of “exercise” than “driving,” for example. “
Using WordNet, a database of word meanings, the researchers mapped the relationship of each action class label in Moments and Kinetics to other labels in both datasets. For example, words like “carving,” “carving,” and “cutting” were combined with higher-level concepts like “crafting,” “making art,” and “cooking.” Now, once the model recognizes an activity such as carving, it can select conceptually similar activities from the dataset.
This relational graph of abstract classes is used to train the model to perform two basic tasks. Given a set of videos, the model creates a numerical representation for each video that matches the verbal representation of the actions shown in the video. The abstraction module then combines the representations generated for each video in the set to create a modern set representation, which is used to identify the abstraction common to all videos in the set.
To see how the model performed compared to humans, the researchers asked participants to complete the same set of visual reasoning tasks online. To their surprise, the model performed as well as humans in many scenarios, sometimes with unexpected results. In a variation on the set task, after watching a video of someone wrapping a gift and covering the item with tape, the model suggested a video of someone on a beach burying someone in the sand.
“It does ‘cover’, but it’s very different from the visual characteristics of other clips,” he says Kamila Foskograduate student at MIT who co-authored the study with a Ph.D Alex Andonian. “Conceptually it fits, but I had to think about it.”
The limitations of the model include its tendency to overemphasize certain features. In one case, it was suggested supplementing a set of sports videos with a video of a child and a ball, apparently associating balls with exercise and competition.
Researchers say a deep learning model that can be trained to think more abstractly may be able to learn with less data. Abstraction also paves the way for higher-level, more human reasoning.
“One of the hallmarks of human cognition is our ability to describe something in terms of something else—to compare and contrast,” Oliva says. “It’s a rich and powerful way of learning that could eventually lead to machine learning models that can understand analogies and get much closer to communicating intelligently with us.”
Other authors of the study include Allen Lee of MIT, Rogerio Feris of IBM and Carl Vondrick of Columbia University.