Inspired by advances in large-scale language modeling, we take a similar approach to building a single generalist agent that goes beyond the realm of textual output. The agent we call Gato acts as a multimodal, multitasking, multiincarnation policy generalist. The same network with the same weights can play Atari, signature images, chat, assemble blocks with a real robot arm, and much more, deciding based on context whether to output text, joint torques, button presses, or other tokens.
During Gato’s training phase, data from different tasks and modalities are serialized into a flat sequence of tokens, grouped, and processed by a transformer neural network similar to a gigantic language model. The loss is masked so that Gato only predicts action and text targets.
When deploying Gato, a prompt, such as a demo, is tokenized, creating an initial sequence. The framework then generates the first observation, which is also tokenized and appended to the sequence. Gato samples the action vector autoregressively, one token at a time.
Once all tokens that make up the action vector have been fetched (as determined by the environment’s action specification), the action is decoded and sent to the environment, which executes the steps and generates a up-to-date observation. The procedure then repeats. The model always sees all previous observations and actions in its context window of 1024 tokens.
Gato is trained on a gigantic number of datasets covering the agent’s experience in both simulated and real environments, in addition to a variety of natural language and image datasets. The number of tasks where the performance of the pre-trained Gato model exceeds a percentage of the expert score, grouped by domain, is shown here.
The images below also show how a pre-trained Gato model using the same weights can do things like caption images, engage in interactive dialogue, and control a robot arm, among other things.