Original version With this story appeared in Quanta Magazine.
Here’s a test for babies: show them a glass of water on the desk. Hide it behind a wooden board. Now move the board towards the glass. If the sign keeps passing the window as if it wasn’t there, are they surprised? Many 6-month-olds have this ability, and after a year almost all babies have an intuitive idea of object permanence, learned through observation. Now some AI models do this too.
Scientists have developed an artificial intelligence system that learns about the world through videos and exhibits the concept of “surprise” when presented with information that contradicts the knowledge it has accumulated.
The model created by Meta and called the Video Joint Embedding Predictive Architecture (V-JEPA) makes no assumptions about the physics of the world contained in the videos. Nevertheless, he can begin to understand how the world works.
“Their claims are a priori very plausible and the results are extremely interesting,” he says Michael Heilbroncognitive scientist at the University of Amsterdam who studies how brains and artificial systems make sense of the world.
Higher abstractions
As engineers building autonomous cars know, ensuring that an AI system reliably understands what it sees can be arduous. Most systems designed to “understand” videos in order to classify their content (for example, “a person playing tennis”) or to identify the outlines of an object, such as a car in front of us, operate in what is called “pixel space”. The model essentially treats every pixel in a video as equally essential.
However, these pixel space models have limitations. Imagine trying to understand a suburban street. If the scene contains cars, traffic lights, and trees, the model may focus too much on unimportant details such as leaf movement. It may not take into account the color of traffic lights or the position of nearby cars. “When you look at photos or videos, you don’t want to work on them [pixel] space because there are too many details you don’t want to model,” he said Randall Balestrierocomputer scientist from Brown University.
