In the current spirit of artificial intelligence, sequential models have gained popularity due to their ability to analyze data and predict next actions. For example, you’ve probably used next token prediction models like ChatGPT, which predict each word (token) in a sequence to create responses to user queries. There are also full-sequence diffusion models, such as Sora, that transform words into dazzling, realistic visuals by successively “denoising” the entire video sequence.
Researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have proposed a elementary change to the diffusion training scheme that will make this denoising sequence much more versatile.
When applied to fields such as computer vision and robotics, next-token and full-sequence dissemination models involve trade-offs in capabilities. Successive token models can spit out sequences of varying lengths. However, they create these generations without being aware of desired states in the distant future – such as directing sequence generation towards a specific target 10 tokens away – and therefore require additional long-term (long-term) planning mechanisms. Diffusion models can perform such forward-conditioned sampling, but they lack the ability of next-token models to generate variable-length sequences.
CSAIL researchers want to combine the strengths of both models, so they have created a sequential model training technique called “Diffusion Forcing.” The name comes from “Teacher Forcing” – a conventional training scheme that breaks down the generation of a complete sequence into smaller, easier steps to generate the next token (much like a good teacher simplifies a intricate concept).
Diffusion forcing has found common ground between diffusion models and teacher forcing: both exploit training schemes that involve predicting masked (cacophonous) tokens from unmasked tokens. In the case of diffusion models, they gradually add noise to the data, which can be viewed as fractional masking. Developed by MIT researchers, the forcing-diffusion method trains neural networks to cleanse a set of tokens, removing a varying amount of noise from each of them while simultaneously predicting several more tokens. The result: a versatile and reliable sequence model that resulted in higher quality artificial videos and more precise decision-making by robots and AI agents.
By sorting cacophonous data and reliably predicting the next steps in a task, diffusion forcing can support a robot ignore visual distractions to perform manipulation tasks. It can also generate stable and consistent video sequences and even guide an AI agent through digital mazes. This method could potentially enable home and factory robots to generalize to recent tasks and enhance AI-generated entertainment.
“Sequence models aim to condition on the known past and predict the unknown future, which is a type of binary masking. But masking doesn’t have to be binary,” says lead author, MIT electrical engineering and computer science (EECS) doctoral student and CSAIL member Boyuan Chen. “By forcing diffusion, we add different levels of noise to each token, effectively serving as a type of fractional masking. During testing, our system can “unmask” the token set and disperse the sequence in the near future with lower noise. It knows who to trust with its data to overcome out-of-distribution inputs.
In several experiments, diffusion forcing has been successful in ignoring misleading data in order to perform tasks and predict future actions.
For example, when implemented in a robotic arm, it helped swap two toy fruits on three circular mats, providing a minimal example of a family of long-term memory-intensive tasks. The researchers trained the robot by controlling it remotely (or remotely) in virtual reality. The robot is trained to mimic the user’s movements using a camera. Despite starting from random positions and noticing distracting elements such as a shopping bag blocking the markers, he placed the objects in the target locations.
To generate the videos, they trained Diffusion Forcing in the game “Minecraft” and colorful digital environments created within Google software DeepMind laboratory simulator. When given a single frame of footage, this method produced more stable, higher resolution videos than comparable baselines such as the Sora-like full-sequence diffusion model and ChatGPT-like next-token models. This approach resulted in videos looking inconsistent, sometimes failing to generate working video beyond just 72 frames.
Diffusion Forcing not only generates fancy videos, but can also serve as a traffic planner that steers you towards your desired outcomes or rewards. Due to its flexibility, Diffusion Forcing can uniquely generate plans with different horizons, perform tree searches, and take into account the intuition that the distant future is more uncertain than the near future. In a 2D maze solving task, Diffusion Forcing exceeded six baselines by generating faster plans to the target location, indicating it could be an effective robot planning tool in the future.
In each demo, the diffusion forcing function operated as a full sequence model, a next token prediction model, or both. According to Chen, this comprehensive approach has the potential to serve as a powerful framework for a “world model,” an artificial intelligence system that can simulate the dynamics of the world by learning from billions of online videos. This would enable robots to perform new tasks by imagining what they need to do based on their surroundings. For example, if you ask a robot to open a door without being trained on how to do it, the model can record a video that shows the machine how to do it.
The team is currently working to scale their method to larger datasets and the latest transformer models to improve performance. They plan to expand on their work by building a ChatGPT-like robot brain that will help robots perform tasks in new environments without human demonstration.
“With Diffusion Forcing, we are taking a step toward bringing video generation and robotics closer together,” says senior author Vincent Sitzmann, an assistant professor at MIT and a member of CSAIL, where he leads the Scene Representation group. “Ultimately, we hope to use all the knowledge gathered from online videos to help robots help in everyday life. Many exciting research challenges remain, such as how robots can learn to imitate humans by observing them, even if their bodies are so different from ours!”
Chen and Sitzmann wrote the paper with recent visiting MIT researcher Diego Martí Monsó and CSAIL collaborators: Yilun Du, EECS graduate student; Max Simchowitz, former postdoctoral fellow and recent assistant professor at Carnegie Mellon University; and Russ Tedrake, Toyota Professor of EECS, Aeronautics and Astronautics, and Mechanical Engineering at MIT, vice president of robotics research at Toyota Research Institute, and CSAIL member. Their work was supported in part by the U.S. National Science Foundation, Singapore’s Defense Science and Technology Agency, Advanced Intelligence Research Projects activities through the U.S. Department of the Interior, and the Amazon Science Hub. They will present their research at the NeurIPS conference in December.