Saturday, April 19, 2025

Performing multiple tasks using a single visual language model

Share

One key aspect of intelligence is the ability to quickly learn how to perform a recent task given a brief instruction. For example, a child can recognize real animals in a zoo after looking at several animal pictures in a book, despite the differences between them. However, for a typical visual model to learn a recent task, it must be trained on tens of thousands of examples specifically labeled for that task. If the goal is to count and identify animals in an image, as in the case of “three zebras,” it would be necessary to collect thousands of images and label each image with their number and species. This process is incompetent, exorbitant, and resource-intensive, requiring immense amounts of annotated data and requiring a recent model to be trained each time it is faced with a recent task. As part of DeepMind’s mission to solve the problem of intelligence, we investigated whether an alternative model could facilitate this process and raise its efficiency, given only circumscribed task-specific information.

Today in our reprint paperwe present Flamingo, a single visual language model (VLM), which sets recent standards for multi-shot learning across a wide range of open-ended multimodal tasks. This means that Flamingo can solve a range of challenging problems using just a few task-specific examples (in “multi-shots”), without requiring additional training. Flamingo’s elementary interface enables this by taking as input a prompt consisting of interwoven images, videos, and text, and then outputting the associated language.

Similar to immense language models (LLMs) that can tackle a linguistic task by processing examples of the task in their text prompt, Flamingo’s visual and textual interface can guide the model toward solving a multimodal task. Given several sample pairs of visual input and expected textual responses composed in Flamingo’s prompt, the model can be asked a question with a recent image or video and then generate an answer.

Across the 16 tasks we studied, Flamingo outperforms all previous “few-shot” learning approaches when given just four examples per task. In several cases, the same Flamingo model outperforms methods that are tuned and optimized independently for each task and employ many orders of magnitude more task-specific data. This should enable nonexperts to quickly and easily employ true visual language models for recent tasks.

In practice, Flamingo combines immense language models with powerful visual representations – each separately pre-trained and frozen – adding recent architectural components in between. It is then trained on a mixture of complementary, large-scale, multimodal data sourced exclusively from the web, without using any machine learning annotated data. Following this approach, we start with Chinchillaour recently introduced computationally optimal 70B parameter language model to train our final Flamingo model, an 80B parameter VLM. Once this training is complete, Flamingo can be directly adapted to visual tasks via elementary few-shot learning without any additional task-specific tuning.

We also tested the model’s qualitative capabilities beyond our current benchmarks. As part of this process, we compared our model’s performance in creating captions for images based on gender and skin color, and ran the model-generated captions through Google’s Perspective API, which assesses text toxicity. While initial results are positive, further research into assessing the ethical risks of multimodal systems is necessary, and we urge people to evaluate and carefully consider these issues before considering implementing such systems in the real world.

Multimodal capabilities are necessary for vital AI applications such as: assistance to visually impaired people with daily visual challenges or improving the identification of hateful content on the web. Flamingo enables proficient adaptation to these examples and other tasks on the fly without modifying the model. Interestingly, the model demonstrates ready-made capabilities for multimodal dialogue, as seen here.

Flamingo is an effective and proficient family of general-purpose models that can be applied to image and video understanding tasks with a minimal number of task-specific examples. Models like Flamingo show great promise for practical benefits to society, and we continue to refine their flexibility and capabilities so that they can be safely deployed for the benefit of all. Flamingo’s capabilities pave the way for prosperous interactions with learned visual language models that can enable better interpretation and electrifying recent applications, such as visual assistants that aid people in their daily lives – and we are excited about the results so far.

Latest Posts

More News