Over the past few years, autoregressive transformers have yielded a steady stream of breakthroughs in generative modeling. These models generate each element of a sample—image pixels, text characters (usually in “token” chunks), audio waveform samples, and so on—by predicting one element at a time. When predicting the next element, the model can fall back on those that were created earlier.
However, each Transformer layer becomes more exorbitant as more elements are used as input, and practitioners can only afford to train deep Transformers on sequences no longer than about 2048 elements. Therefore, most Transformer-based models ignore all elements beyond the recent past (about 1500 words or 1/6 of a miniature image) when making a prediction.
In contrast, our recently developed Perceiver models show excellent results on a variety of real-world tasks, with up to about 100,000 elements. Perceiver uses cross-attention to encode input data into a latent space, decoupling the computational requirements of the input data from the model depth. Perceiver also incurs a constant cost, regardless of the input size, at almost every layer.
While latent space encoding handles all elements in a single pass, autoregressive generation assumes that processing is done one element at a time. To solve this problem, Perceiver AR proposes a straightforward solution: align the hidden elements one by one with the final elements of the input, and carefully mask the input so that the hidden elements only see the earlier elements.
The result is an architecture (pictured above) that supports up to 50 times longer input data than standard Transformers, while being as wide-ranging (and generally as effortless) to operate as standard decoder-only Transformers.
The AR Perceiver scales much better with size than the standard Transformers and Transformer-XL models in terms of real-world sequence lengths. This property allows us to build very capable long-context models. For example, we found that the 60-layer AR Perceiver with a context length of 8192 outperforms the 42-layer Transformer-XL in the book-length generation task, while performing faster in terms of real-world wall clocks.
On the standard benchmarks of long-context image (ImageNet 64×64), language (PG-19), and music (MAESTRO), Perceiver AR produces state-of-the-art results. Increasing the input context by decoupling the input size from the computational budget leads to several intriguing results:
- The compute budget can be adjusted during evaluation, allowing us to spend less and smoothly reduce quality or spend more on generation improvements.
- Larger context allows Perceiver AR to outperform Transformer-XL, even at higher computational costs. We found that larger context leads to improved model performance even at affordable scales (~1B parameters).
- Perceiver AR sample quality shows much less sensitivity to the order in which the elements are generated. This allows Perceiver AR to be easily applied to settings that do not have a natural left-to-right ordering, such as data, such as images, with a structure spanning more than one dimension.
Using a dataset of piano music, we trained Perceiver AR to generate recent pieces of music from scratch. Because each recent note is predicted based on the full sequence of notes that came before it, Perceiver AR is able to produce pieces with high levels of melodic, harmonic, and rhythmic coherence: