Researchers at Recent York University have developed a modern architecture for diffusion models that improves the semantic representation of the images they generate. “Diffusion transformer with representation autoencoders” (RAE) challenges some of the accepted norms of building diffusion models. The NYU researcher’s model is more effective and right than standard diffusion models, leverages the latest research in representation learning, and could pave the way for modern applications that were previously too challenging or steep.
This breakthrough could unlock more hearty and effective features for enterprise applications. “To edit photos well, a model needs to really understand what’s in them,” co-author Saining Xie told VentureBeat. “RAE helps connect the understanding part with the producing part.” He also pointed to future applications in “RAG-based generation, where RAE encoder functions are used to search and then generate new images based on the search results,” as well as “video generation and action-conditioned world models.”
The state of generative modeling
Diffusion modelsthe technology behind most of today’s powerful image generators, frame generation as a learning process to compress and decompress images. AND variational autoencoder (VAE) learns a compact representation of key image features in a so-called “latent space”. The model is then trained to generate modern images by inverting the process away from random noise.
Although the diffusion portion of these models has advanced significantly, the autoencoder used in most of them has remained largely unchanged in recent years. According to NYU researchers, this standard autoencoder (SD-VAE) is suitable for capturing low-level features and local appearance, but lacks “the global semantic structure crucial for generalization and generative performance.”
At the same time, the field has made impressive progress in learning image representations with models such as DINO, MAE, and CLIP. These models learn semantically structured visual features that can be generalized across a variety of tasks and can serve as a natural basis for visual understanding. However, a common belief prevents developers from using these architectures to generate images: semantic-centric models are unsuitable for image generation because they do not capture detailed pixel-level features. Practitioners also believe that diffusion models do not perform well for the high-dimensional representations produced by semantic models.
Diffusion with representation encoders
Researchers at Recent York University propose replacing standard VAE with “representation autoencoders” (RAE). This modern type of autoencoder combines a pre-trained representation encoder, e.g DINO Metawith trained video transformer decoder. This approach simplifies the training process by using existing, effective encoders that have already been trained on huge datasets.
To make this work, the team developed a variant diffusion transformation (DiT), which is the basis of most image generation models. This modified DiT can be efficiently trained in the high-dimensional RAE space without incurring huge computational costs. Researchers have shown that frozen representation encoders, even those optimized for semantics, can be adapted to image generation tasks. Their method allows for reconstructions that are better than standard SD-VAE, without increasing architectural complexity.
However, adopting this approach requires a change of thinking. “RAE is not a simple plug-and-play autoencoder; the diffusion modeling part also needs to evolve,” Xie explained. “One of the key points we want to emphasize is that latent space modeling and generative modeling should be designed together, not treated separately.”
After making appropriate architectural adjustments, researchers found that multidimensional representations provided an advantage, offering richer structure, faster convergence, and better generation quality. IN their paperthe researchers note that this “latent higher-dimensional information does not actually incur any additional computational or memory costs.” Moreover, standard SD-VAE is computationally more steep and requires approximately six times as much computation for the encoder and three times as much for the decoder compared to RAE.
Greater efficiency and performance
The modern model architecture provides significant increases in both training efficiency and generation quality. The diffusion recipe improved by the team achieves good results after just 80 training periods. Compared to previous diffusion models trained on VAE, the RAE-based model achieves a 47x training speedup. It also outperforms state-of-the-art representation-alignment-based methods with a 16x training speedup. This level of efficiency translates directly into lower training costs and faster model development cycles.
For enterprise applications, this means more reliable and consistent results. Xie noted that RAE-based models are less susceptible to the semantic errors found in classical diffusion, adding that RAE gives the model a “much smarter look at the data.” He noted that leading models such as ChatGPT-4o and Google’s Nano Banana are moving toward “topic-oriented, highly consistent and knowledge-enriched generation” and that semantically prosperous RAE foundations are key to achieving this reliability at scale and in open source models.
The researchers demonstrated this performance in the ImageNet benchmark. Using Fréchet initial distance (FID), where a lower score indicates higher quality images, the RAE-based model achieved a state-of-the-art score of 1.51 without guidance. Thanks to AutoGuidance, a technique that uses a smaller model to control the generation process, the FID score dropped to an even more impressive 1.13 for both 256×256 and 512×512 images.
By successfully integrating newfangled representation learning into a dissemination framework, this work opens a modern path to building more effective and cost-effective generative models. This unification points to a future of more integrated AI systems.
“We believe that in the future there will be a single, unified representation model that captures the rich, underlying structure of reality… capable of being decoded in many different output ways,” Xie said. He added that RAE offers a unique path to this goal: “Multi-dimensional latent space should be learned separately to provide strong prior information that can then be decoded in various ways, rather than relying on a brute-force approach of mixing all the data and training with multiple targets at once.”
