Saturday, March 14, 2026

Apart from GPT architecture: why Google diffusion approach can transform the implementation of LLM

Share


Join the event trusted by corporate leaders for almost two decades. VB Transforma connects people building AI Real Enterprise. Learn more


Last month, together with a comprehensive set of novel AI tools and innovations, Google Deepmind exposed Twins diffusion. This experimental research model uses a diffusion -based approach to generating text. Traditionally, gigantic language models (LLM), such as GPT and Gemini itself, were based on autoregression, a step -by -step approach, in which each word is generated on the basis of the previous one. Models of the diffusion language (DLM), also known as models of gigantic languages ​​based on diffusion (DLLM), exploit the method more common in image generation, starting with random noise and gradually providing it with a coherent output. This approach significantly increases the production speed and can improve consistency and consistency.

Twin diffusion is currently available as an experimental demo; Register on the waiting list here to access.

(Editor’s attention: We will unpack the paradigm changes, such as diffusion-based language models-what you need to launch them in production-AT VB TransformJune 24-25 in San Franciscotogether with Google Deepmind, LinkedIn and other AI Enterprise leaders.)

Understanding diffusion vs. autoregression

Diffusion and autoregression are essentially different approaches. The autoregressive approach generates a text sequentially, with tokens expected individually. Although this method ensures forceful coherence and context tracking, it may be intensive computing and sluggish, especially in the case of a long form.

However, diffusion models begin with a random noise, which is gradually transmitted into a coherent output. After applying in the language, the technique has several advantages. Text blocks can be processed in parallel, potentially producing entire segments or sentences at a much higher pace.

Twin diffusion can supposedly generate 1000-2000 tokens per second. However, Gemini 2.5 flash has an average output speed of 272.4 tokens per second. In addition, production errors can be improved during the refining process, improving accuracy and reducing the number of hallucinations. There may be compromises in terms of accuracy of the token level; However, the raise in speed will be changing into many applications.

How does diffusion based on diffusion work?

During the training, DLM works, gradually damaging his mind with noise for many steps until the original sentence is completely beyond recognition. The model is then trained to reverse this process, step by step, reconstruct the original sentence from increasingly cacophonous versions. Through iterative improvement, learns to model the entire distribution of credible sentences in training data.

Although the specificity of Gemini’s diffusion has not yet been disclosed, a typical training methodology for the diffusion model includes these key stages:

Freeze diffusion: With each sample in the set of training data, the noise is added gradually by many cycles (often 500 to 1000), until it becomes indistinguishable from random noise.

Reverse diffusion: The model learns to reverse each stage of the bat process, basically learning how “Denoise” a corrupt sentence at once at once, ultimately restoring the original structure.

This process is repeated millions of times with a variety of samples and noise level, which allows the model to learn a reliable denoising function.

After training, the model is able to generate completely novel sentences. DLMs usually require state or input data, such as prompt, class label or deposition to lead the generation towards the desired results. The condition is injected into each stage of the denoising process, which shapes the initial drop of noise into a structural and coherent text.

Advantages and disadvantages of diffusion -based models

In an interview with Venturebeat, Brendan O’Donoghue, a scientist at Google Deepmind and one of the potential customers of the Gemini diffusion project, developed on some advantages of diffusion -based techniques compared to autoregression. According to O’Donoghue, the main advantages of diffusion techniques are as follows:

  • Lower delays: Dyffusion models can produce tokens sequence in a much shorter time than autoregressive models.
  • Adaptation calculations: Dyfusion models will coincide with the tokens sequence at different speeds depending on the difficulty of the task. This allows the model to less resources (and have lower delays) on straightforward tasks and more on more arduous.
  • Non -pecuniary reasoning: Due to the two -way attention in Denoiser, tokens can bring future tokens in the same generation block. This enables non-coffee reasoning and allows the model to make global editions in the block to obtain a more coherent text.
  • Iterative improvement / self -improvement: The denoising process includes taking samples that can introduce errors, as in autoregression models. However, unlike autoregressive models, tokens are transmitted back to Denoiser, which then has the ability to correct the error.

O’Donoghue also noticed the main disadvantages: “Higher cost of kidnappings and a slightly higher time to the first token (TTFT), because autoregressive models will immediately create the first token. To achieve diffusion, the first token may appear only if the entire sequence of token.”

Benchmark performance

Google says that the performance of Diffusion Gemini is Comparable to Gemini 2.0 Flash-Lite.

BenchmarkTypeTwins diffusionGemini 2.0 Flash-Lite
Livecodebench (V6)Code30.9%28.5%
BigcodebenchCode45.4%45.8%
LBPP (V2)Code56.8%56.0%
Your Bench verified*Code22.9%28.5%
HumanevalCode89.6%90.2%
MBPPCode76.0%75.8%
Gpqa diamondScience40.4%56.5%
Aime 2025Mathematics23.3%20.0%
Gigantic Bench Extra arduousReasoning15.0%21.0%
Global mml (lite)Multilingual69.1%79.0%

* Non -magical assessment (only one -time edition), maximum brisk length 32k.

Two models were compared using several reference points, with results based on how many times the model gave the correct answer at the first attempt. Twin diffusion worked well in coding and mathematics tests, while Gemini 2.0 Flash-Lite had an advantage over reasoning, scientific knowledge and multilingual possibilities.

As Gemini’s diffusion evolutions, there is no reason to think that its performance will not catch up in more fixed models. According to O’Donoghue, the gap between the two techniques is “basically closed in terms of comparative efficiency, at least with relatively small sizes to which we scaled. In fact, there may be a certain advantage of performance for diffusion in some domains, in which unlocal coherence is important, for example, coding and reasoning.”

Gemini diffusion testing

Venturebeat has received access to an experimental demo. When performing Gemini’s diffusion by its pace, the first thing we noticed was speed. When launching suggested hints provided by Google, including building interactive HTML applications, such as xylophone and planet Tac Toe, each request ended in less than three seconds, at speeds from 600 to 1300 tokens per second.

To test your performance using a real application, we asked Gemini Diffusion to build a video chat interface with the following monitor:

Build an interface for a video chat application. It should have a preview window that accesses the camera on my device and displays its output. The interface should also have a sound level meter that measures the output from the device's microphone in real time.

In less than two seconds, Gemini’s diffusion has created a working interface with a video preview and an audio meter.

Although this was not a complicated implementation, it could be the beginning of MVP, which can be completed with a diminutive number of hints. Note that Gemini 2.5 Flash has also created a working interface, although at a slightly slower pace (about seven seconds).

Gemini’s diffusion also contains “Instant Edit”, a mode in which text or code can be pasted and edited in real time with minimal hints. Instant edition is effective for many types of text editing, including grammar correcting, text updating to focus on various personalities of readers or add SEO keywords. It is also useful for tasks, such as the re -invoice code, adding novel functions to the application or transforming the existing code base into another language.

Cases of using an enterprise for DLMS

It can be safely said that any application that requires a quick response time can exploit DLM technology. This includes applications in real time and low delays, such as AI and Chatbot, transcription and live translation, or assistants of Autocomplete and Coding IDE.

According to O’Donoghue, with applications that exploit the “built -in edition, for example, taking a piece of text and introducing some changes on the spot, diffusion models apply in a way that autoregressive models are not.” DLM also has an advantage because of reason, mathematics and coding problems, because of “undisturbed reasoning of a given two -way attention.”

DLM is still in its infancy; However, technology can potentially transform the way of building language models. They not only generate text at a much higher pace than autoregressive models, but their ability to return and repair errors means that they can ultimately also give results with greater accuracy.

Twin diffusion enters the growing DLM ecosystem, with two significant examples Mercurydeveloped by Incepcja laboratories i LladaOpen Source model with GSAI. Together, these models reflect a wider moment of generating diffusion based and offer a scalable, parallel alternative to classic autoregression architecture.

Latest Posts

More News