The Internet experienced a collective feel-good moment with the introduction of DALL-E, an AI-powered image generator inspired by artist Salvador Dali and the adorable WALL-E robot that uses natural language to create the mysterious and stunning image your heart desires. Seeing typed entries like “smiling gopher holding an ice cream cone” come to life instantly has made a clear impact on the world.
Getting a smiling gopher and attributes to appear on screen is no effortless task. DALL-E 2 uses something called a diffusion model, where it tries to encode all the text into one description to generate an image. However, when the text contains much more detail, it is tough to fit them into one description. Moreover, although they are very malleable, they sometimes have problems understanding the composition of certain concepts, for example confusing attributes or relationships between different objects.
Take for example a red truck and a green house. The model will confuse the concepts of a red truck and a green house when these sentences become very complicated. A typical generator like DALL-E 2 can create a green truck and a red house, so it will swap these colors. The team’s approach handles this type of attribute-to-object association and, especially when dealing with multiple sets of things, allows for more exact handling of each object.
“The model can effectively model object positions and relational descriptions, which is a challenge for existing image generation models. For example, place an object and a cube in a certain place and a ball in another place. DALL-E 2 is good at generating natural images, but sometimes has difficulty understanding relationships between objects,” says MIT CSAIL graduate student and co-author Shuang Li. “Beyond art and creativity, perhaps we could use our model for teaching. If you want to tell a child to put a cube on a ball and if we say it in language, it may be difficult for him to understand. But our model can generate an image and show it.”
To make Dali proud
Composite diffusion – Ensemble Model – Uses diffusion models along with compositional operators to combine text descriptions without further training. The team’s approach more accurately captures text details than the original diffusion model, which directly encodes words as a single long sentence. For example, given “a pink sky” AND “a blue mountain on the horizon” AND “cherry blossoms in front of the mountain”, the team’s model was able to produce this exact image, while the original diffusion model produced the sky blue and everything in front of the mountains pink.
“The fact that our model is composable means that you can learn different parts of the model one at a time. You can first learn the object on top of the other, then learn the object to the right of the second, and then learn something left over from the other,” says co-author and MIT CSAIL PhD student Yilun Du. “Because we can combine these, you can imagine that our system allows us to gradually learn language, relationships and knowledge, which we think is quite an interesting direction for future work.”
While it demonstrated the ability to generate complicated, photorealistic images, it still faced challenges because the model was trained on a much smaller dataset than models like DALL-E 2, so it simply couldn’t capture some objects.
Now that Composable Diffusion can run on generative models like DALL-E 2, researchers want to look at continuous learning as a potential next step. Given that more is typically added to object relationships, researchers want to see if diffusion models can begin to “learn” without forgetting previously acquired knowledge – to the point where the model can create images with both prior and recent knowledge.
“This study proposes a new method for composing concepts when generating text to image not by combining them to create a tooltip, but rather by computing the scores for each concept and composing them using conjunction and negation operators,” says Mark Chen, co-founder of DALL- E 2 and researcher at OpenAI. “It’s a good idea that uses the energy-based interpretation of diffusion models so that old ideas energy-based models can be used around compositionality. This approach can also leverage classifier-free guidance, and what is surprising is that it outperforms the GLIDE baseline on various composition benchmarks and can qualitatively produce very different types of image generation.”
“Humans have many ways to compose scenes with different elements, but it’s difficult for computers,” says Bryan Russell, a research associate at Adobe Systems. “This work proposes an elegant formulation that explicitly composes a set of diffusion models to generate an image from a complex natural language prompt.”
In addition to Li and Du, co-authors of the paper include Nan Liu, a computer science student at the University of Illinois at Urbana-Champaign, and MIT professors Antonio Torralba and Joshua B. Tenenbaum. The works will be presented at an exhibition in 2022 European Conference on Computer Vision.
The research was supported by Raytheon BBN Technologies Corp., Mitsubishi Electric Research Laboratory and DEVCOM Army Research Laboratory.