Q: AI-generated images operate models called “stable diffusion” to turn words into stunning images in moments. But behind every image used there is usually a person. So what is the line between artificial intelligence and human creativity? How do these models really work?
AND: Imagine all the images you can find in a Google search and the patterns associated with them. This is the diet these models eat. They are trained on all these images and their captions to generate images similar to the billions of images they have seen on the Internet.
Suppose the model has seen many photos of dogs. It is so trained that when it receives a text input prompt similar to “dog”, it will be able to generate a photo that looks very similar to many photos of dogs you have already seen. More methodologically, how this all works goes back to a very venerable class of models called “energy-based models” dating back to the 1970s and 1980s.
In energy-based models, an energy landscape is created in the images and used to simulate physical scattering to generate the images. When we drop a dot of ink into water and it dissolves, e.g. at the end, we simply get a uniform texture. However, if you try to reverse this scattering process, the original ink dot will gradually reappear in the water. Or let’s say you have a very complicated tower of blocks, and if you hit it with a ball, it collapses into a pile of blocks. This stack of blocks is then very messy and doesn’t have much structure to it. To revive the tower, you can try to reverse the assembly process to generate the original stack of blocks.
The way these generative models generate images is very similar: initially you have a really nice image, you start with this random noise and you basically learn how to simulate the process of reversing that process from the noise back to the original image, where you try to iteratively refine that image to make it more and more realistic.
When it comes to the line between artificial intelligence and human creativity, you could say that these models are truly trained in human creativity. There are all kinds of pictures and images on the Internet that people have already created in the past. These models are trained to summarize and generate images that have appeared on the Internet. As a result, these models are more like a crystallization of what people have been working creatively on for hundreds of years.
At the same time, because these models are trained on what humans have designed, they can generate works of art very similar to what humans have done in the past. They can find patterns in art created by humans, but it is much more complex for these models to generate original photos on their own.
If you try to type in a prompt like “abstract art” or “unique art” or the like, it won’t really understand the original aspect of human art. Models tend to summarize what people have done in the past, so to speak, rather than creating fundamentally up-to-date and original art.
Because these models are trained on huge swaths of images from the Internet, many of them are likely copyrighted. You don’t know exactly what the model is taking in when generating up-to-date images, so the massive question is how you can even determine whether the model is using copyrighted images. If the model depends in some way on copyrighted images, are these up-to-date images copyrighted? This is another question that needs to be addressed.
Q: Do you believe that images generated by diffusion models encode some kind of understanding of the natural or physical world, dynamically or geometrically? Is there an effort to “teach” image generators the basics of the universe that children learn so early?
AND: Do they understand, through code, some idea about the natural and physical world? I think definitely. If you ask the model to generate a stable block configuration, it will surely generate a stable block configuration. If you say this, generate an unstable block configuration, it looks very unstable. Or if you say “the tree next to the lake”, this is roughly what it can generate.
In some ways, these models seem to have captured a vast aspect of common sense. However, the issue that still takes us very far from truly understanding the natural and physical world is that when you try to generate uncommon combinations of words that you or I in our work can imagine very easily, these models are unable to cope with basic to imagine.
For example, if you say “put the fork on the plate”, it happens all the time. If you ask the model to generate this, it will easily be able to do so. If you say “put the plate on the fork”, again it’s very basic for us to imagine what that would look like. But if you put it in any of these vast models, you’ll never get a plate on the fork. Instead, you put your fork on your plate as the models learn to summarize all the images they were trained on. He can’t generalize as well from word combinations he hasn’t seen.
A quite celebrated example is an astronaut riding a horse, which the model can easily do. But if you say a horse rides an astronaut, it still creates a person riding a horse. These models seem to capture a lot of correlations in the datasets they are trained on, but they don’t actually capture the underlying causal mechanisms of the world.
Another commonly used example is very convoluted text descriptions, such as one object to the right of another, a third object in front, and a third or fourth object in flight. In fact, it can only satisfy one or two facilities. This may be partly due to the training data, as very convoluted signatures are uncommon. However, this may also suggest that these models are not very tidy. You can imagine that if you receive very convoluted natural language prompts, the model will not be able to accurately reflect all the details of the component.
Q: You recently invented a up-to-date method that uses multiple models to create more convoluted images with a better understanding of generative art. Are there potential applications for this framework beyond the image and text domains?
AND: We were really inspired by one of the limitations of these models. If you give these models very convoluted scene descriptions, they actually won’t be able to correctly generate images that match them.
One thought is that since this is a single model with a fixed computational graph, meaning that only a certain number of computations can be used to generate an image, if an extremely convoluted prompt appears, there is no way to operate more computational power to generate this image.
If I give a human a description of a scene that is, say, 100 lines long compared to a scene that is one line long, a human artist can spend much more time on the former. These models don’t have the sensitivity to do that. We therefore propose that with very convoluted prompts, you can compose many different independent models together, so that each of them represents part of the scene you want to describe.
We found that this allows our model to generate more convoluted scenes, or ones that more accurately generate different aspects of the scene together. Moreover, this approach can be generally applied to many different fields. Although image generation is probably the most effective application today, generative models have actually found their way into all types of applications in various fields. They can be used to generate different, diverse robot behaviors, synthesize 3D shapes, enable better understanding of a scene, or design up-to-date materials. You can potentially compose multiple desired factors to generate exactly the material you need for a specific application.
One thing we were very interested in was robotics. In the same way that you can generate different images, you can also generate different robot trajectories (path and schedule), and by combining different models, you can generate trajectories with different skill combinations. If I have natural language specifications for jumping and avoiding obstacles, you can also compose these models together and then generate trajectories for a robot that can both jump and avoid obstacles.
In a similar way, if we want to design proteins, we can specify different functions or aspects – in a way analogous to how we operate language to specify the content of images – using linguistic descriptions such as the type or functionality of the protein. We could then compose them together to generate up-to-date proteins that could potentially perform all of these functions.
We also explored the operate of diffusion models in 3D shape generation, where this approach can be applied to 3D asset generation and design. Typically, designing 3D assets is a very complicated and labor-intensive process. By combining different models together, it becomes much easier to generate shapes such as “I want a 3D shape with four legs, with this style and height,” potentially automating parts of 3D asset design.