Sunday, April 20, 2025

Measuring perception in AI models

Share

A modern benchmark in the evaluation of multimodal systems based on real video, audio and text data

WITH Turing Test Down ImageNetBenchmarks have played a fundamental role in shaping artificial intelligence (AI), helping to define research goals and enabling researchers to measure progress toward those goals. Incredible breakthroughs in the past 10 years, such as AlexNet in computer vision and Alfa-Skład in protein folding, were closely linked to the exploit of benchmark data sets, enabling researchers to rank model designs and training choices and to iterate to improve their models. As we work toward the goal of building artificial general intelligence (AGI), developing resilient and effective benchmarks that extend the capabilities of AI models is as significant as developing the models themselves.

Perception—the process of experiencing the world through our senses—is an crucial part of intelligence. And building agents with a human-level perceptual understanding of the world is a central but challenging task that is becoming increasingly significant in robotics, self-driving cars, personal assistants, medical imaging, and more. That’s why today we’re introducing Perception Testa multimodal benchmark using real-world videos to support assess the model’s perceptual capabilities.

Developing a perception pattern

Many perceptual reference points are currently used in artificial intelligence research, such as Kinetics for video action recognition, Audio set for classifying audio events, AGAINST for tracking objects or VGA for answering questions about images. These benchmarks have led to incredible progress in the way AI model architectures and training methods are built and developed, but each of them focuses only on restricted aspects of perception: image benchmarks exclude temporal aspects; visual question answering tends to focus on understanding the high-level semantic scene; object tracking tasks typically capture the lower-level appearance of individual objects, such as color or texture. And very few benchmarks define tasks in both audio and visual modalities.

Multimodal models such as Perceiver, Flamingo or BEiT-3they aim to create more general models of perception. However, their assessments were based on multiple specialized datasets because no dedicated reference point was available. The process is tardy, costly, and does not fully address general perceptual abilities such as memory, making it challenging for researchers to compare methods.

To address many of these issues, we created a dataset consisting of purposefully designed videos of real-world activities, labeled into six different task types:

  1. Object tracking: At the beginning of the video, the object is surrounded by a box, and the model must return a full path throughout the video (including occlusions).
  2. Tracking points: point is selected early in the video recording, the model must track this point throughout the video recording (also through occlusions).
  3. Time action location: the model must temporally locate and classify a predefined set of activities.
  4. Fleeting sound localization: The model must temporally localize and classify a predefined set of sounds.
  5. Answering multiple choice video questions: text questions about the film, each with three answer choices.
  6. Groundbreaking Video Q&A: video text questions, the model must return one or more object traces.

We drew inspiration from the way children’s perceptions are assessed in developmental psychology, as well as from synthetic datasets such as SUPPLY AND Smarterand designed 37 video scenarios, each with different variations to ensure a balanced dataset. Each variant was filmed by at least a dozen crowdsourced participants (similar to previous work on Charades AND Something something), in which over 100 people took part, resulting in 11,609 videos with an average length of 23 seconds.

The videos present elementary games or everyday activities that will allow us to define tasks that require the following skills to solve:

  • Knowledge of semantics: testing aspects such as task completion, recognition of objects, actions or sounds.
  • Understanding Physics: collisions, movement, occlusion, spatial relations.
  • Temporal reasoning or memory: ordering events in time, counting in time, detecting changes in the scene.
  • Abstraction Abilities: shape matching, same/different concepts, pattern detection.

Crowdsourced participants annotated videos with spatial and temporal annotations (object envelope paths, point paths, action segments, audio segments). Our research team designed questions by scenario type for multiple-choice and grounded video questions to ensure a wide variety of skills tested, such as questions testing the ability to reason counterfactually or provide explanations in a given situation. Crowdsourced participants again provided appropriate responses for each video.

Evaluation of multimodal systems using a perception test

We assume that the models have been pre-trained on external datasets and tasks. The Perception Test includes a compact tuning set (20%) that model creators can optionally exploit to convey the nature of the tasks to the models. The remaining data (80%) consists of a public validation split and a held test split where performance can only be assessed via our evaluation server.

Here we show a diagram of the evaluation setup: the inputs are a video and audio sequence and a task specification. The task can be in the form of high-level text, allowing you to visually answer questions, or provide low-level input, such as the coordinates of an object’s bounding box for an object tracking task.

Assessment results are detailed across several dimensions, and skills are measured across six computational tasks. For visual question answering tasks, we also provide mapping of the questions to the different types of situations shown in the videos and the types of reasoning required to answer the questions for more detailed analysis (see our paper for more details). An ideal model would maximize results across all radar plots and across all dimensions. This is a detailed assessment of the model’s skills, allowing you to narrow down areas for improvement.

When developing the pattern, it was crucial to ensure diversity in the participants and scenes shown in the videos. For this purpose, we selected participants from different countries, ethnicities and genders, and our goal was to ensure diverse representation in each type of video scenario.

Learn more about the perception test

The Perception Test benchmark is publicly available Here and further details are available in our paper. Leaderboard and challenge server will also be available soon.

On October 23, 2022, we will be hosting workshop on general models of perception at the European Computer Vision Conference in Tel Aviv (ECCV 2022), where we will discuss our approach and how to design and evaluate general models of perception with other leading experts in the field.

We hope that the Perception Test will inspire and guide further research into general perception models. In the future, we hope to collaborate with the multimodality research community to introduce additional annotations, tasks, metrics, and even modern languages ​​to the benchmark.

If you are interested in cooperation, please contact us by sending an email to Percepcja-test@google.com!

Latest Posts

More News