
Photo by the author Ideogram
In recent years, generative AI models have appeared as a rising star, especially with the introduction of products Huge Language (LLM) Chatgpt. By using natural language that people can understand, these models can process the input data and provide the right output. As a result of products such as chatgpt, other forms of generative artificial intelligence have also become popular and main.
Products such as WITH AND Midjourney They have become popular among the AI generative boom due to their ability to generate images only from the natural language contribution. These popular products do not create images from nothing; Instead, they rely on a model known as a diffusion model.
In this article, we will separate the diffusion model to get a deeper understanding of technology behind it. We will discuss the fundamental concept of how the model works and how it is trained.
Inquisitive? Let’s get it.
# Basics of the diffusion model
Diffusion models are a class of AI algorithms that belong to the category of generative models designed to generate fresh data based on training data. In the case of diffusion models, this means that they can create fresh images from the input data provided.
However, diffusion models generate images in a different process than usual in which the model adds and then removes the noise from the data. Simply put, the diffusion model changes the image and then improved it to create a final product. You can think about the model as a denoising model, because it learns to remove noise from images.
Formally, the diffusion model appeared for the first time in the article Deep learning without supervision using imbalance thermodynamics Author: Sohl-Dickstein et al. (2015). The article introduces the concept of transforming data into noise using a process named for the inspection process of the diffusion forward, and then training the model to reverse the process and reconstruct data, which is the denoising process.
Based on this foundation, paper Denoising diffusion probabilistic models by Ho and others (2020) introduces state-of-the-art diffusion framework that can produce high -quality paintings and exceed previous popular models, such as generative opposite networks (Gans). In general, the diffusion model consists of two critical stages:
- Process forward (diffusion): The data is damaged by gradually adding noise until it becomes indistinguishable from random immobile
- Reverse process: The neuron network is trained for iterative noise removal, learning how to reconstruct the image from complete randomness
Let’s try to better understand the components of the diffusion model to have a clearer image.
// Forward process
The forward process is the first phase in which the image is systematically degraded by adding noise until random immobile.
The forward process is controlled and iterative, which we can summarize in the following steps:
- Start with the image from the data set
- Add a compact amount of noise to the image
- Repeat this process many times (potentially hundreds or thousands), each time additionally damaging the image
After enough steps, the original image will appear as a pure noise.
The above process is often modeled mathematically as a brand chain, because each clamorous version depends only on the immediately preceding, and not on the entire sequence of steps.
But why should we gradually turn the image into noise instead of transforming it straight into noise in one step? The goal is to enable the model of gradual learning to reverse corruption. Petite, incremental steps allow the model to learn how to pass from clamorous to less hopeless data, which helps in rebuilding the image step by step from pure noise.
To determine how much noise is added at every stage, the concept of noise schedule is used. For example, linear schedules constantly introduce noise at a time, while the Cosinus schedules introduce noise more gradually and retain useful image features for a longer period.
This is a quick summary of the forward process. Let’s find out about the reverse process.
// Reverse
The next stage after the forward process is the transformation of the model into a generator, which learns to transform the noise back into a given image. Through the iterative compact steps, the model can generate image data that did not exist before.
In general, the opposite process is the opposite of the process:
- Start with pure noise – a completely random picture of Gauss’s noise
- Iteratively remove the noise using a trained model that is trying to bring the opposite version of each step forward. At each stage, the model uses a current clamorous image and the right time as input data, anticipating how to reduce noise based on what he learned during training
- Step by step, the image is becoming more and more pronounced, which results in the final data of the image
This opposite process requires a model trained for Denoise Housy Images. Dyfusion models often apply the architecture of the neural network, such as U-Net, which is an autoencoder that connects the weave layers in the Ender-Decoder structure. During the training, the model learns to predict noise components added during the process. At each stage, the model also considers time, which allows you to adapt its forecasts according to the level of noise.
The model is usually trained using a loss function, such as an average square error (MSE), which measures the difference between the expected and actual noise. By minimizing this loss in many examples, the model gradually becomes expert when reversing the diffusion process.
Compared to alternatives such as Gan, diffusion models provide greater stability and a simpler generative path. Denoising approach step by step leads to more expressive learning, which makes the training more reliable and interpretation.
After full training, generating a fresh image summarizes the reverse process, which we summed up above.
// Conditioning of the text
In many text products, such as Dall-E and Midjourney, these systems can manage the reverse process using text monitors, which we call the condition of the text. By integrating natural language, we can get a matching scene, not random visualizations.
The process works by using a pre -trained text encoder such as Clip (contrasting language and image training)which transforms the hint text into a vector deposition. This embedding is then transferred to the architecture of the diffusion model using a mechanism such as cross-antenance, a type of attention mechanism that allows the model to focus on specific parts of the text and adapting the process of generating image to the text. At every stage of the reverse process, the model analyzes the current status of the image and text prompting, using cross -using to align the image with semantics with monitor.
This is a basic mechanism that allows Dall-E and Midjourney to generate images from hints.
# What is the difference between Dall-E and Midjourney?
Both products apply diffusion models as the basis, but differ slightly in terms of technical applications.
For example, Dall-E uses a diffusion model led by embedding based on clips for conditioning the text. Midjourney, on the other hand, has its own architecture of the diffusion model, which apparently contains a refined image decoder optimized for high realism.
Both models are also based on crosses, but their clues are different. Dall-e puts emphasis on the guide through the classifier free of tips, which balances between the unconditional and conditioned text. Midjourney, on the other hand, tends to priority for stylistic interpretation, probably using a higher scale of default guidelines for tips without a classifier.
Dall-E and Midjourney differ in terms of quick length and complexity, because the Dall-E model can manage longer hints by processing them before entering the diffusion pipeline, while Midjourney tends to achieve better results with concise hints.
There are more differences, but those you should know relate to diffusion models.
# Application
Diffusion models have become the basis of contemporary text systems for image, such as Dall-E and Midjourney. Using basic diffusion processes forward and backward, these models can generate completely fresh paintings from randomness. In addition, these models can apply natural language to manage results through mechanisms such as text conditioning and crossing.
I hope it helped!
Cornellius Yudha Wijaya He is a data assistant and data writer. Working full -time at Allianz Indonesia, he loves to share Python and data tips through social media and media writing. Cornellius writes on various AI topics and machine learning.
