Technologies
We present a comprehensive, open set of distributed autoencoders for interpreting language models.
To create an artificial intelligence (AI) language model, scientists build a system that learns from massive amounts of data without human guidance. As a result, the inner workings of language models are often a mystery, even to the researchers who train them. Mechanistic interpretability is a research area focused on deciphering these internal mechanisms. Researchers in this area operate infrequent autoencoders as a kind of ‘microscope’ that allows them to look inside the language model and better understand how it works.
Today, we are announcing Gemma Scopea up-to-date set of tools to aid researchers understand the inner workings of Gemma 2, our lightweight open model family. Gemma Scope is a collection of hundreds of freely available, open, limited autoencoders (SAE) for Gems 2 9B AND Gemma 2 2B. We also provide the source code Mixeda tool that we created and that has enabled much of the interpretive work of Gemma Scope.
We hope that today’s publication will enable more ambitious interpretability studies. Further research has the potential to aid the field build more hearty systems, develop better safeguards against model hallucinations, and protect against threats from autonomous AI agents, such as deception or manipulation.
Try our interactive Gemma Scope democourtesy of Neuronpedia.
Interpreting what’s happening inside the language model
As the model processes text input, activations at different layers of the model’s neural network represent a number of increasingly advanced concepts, known as “features.”
For example, early layers of the model can learn recall the facts Yes Michael Jordan plays basketballwhile later layers can recognize more intricate concepts such as factual information of the text.
Stylized representation of using a limited autoencoder to interpret model activations as it resembles the fact that the City of Lights is Paris. We see that concepts related to the French language are present, while unrelated ones are not.
However, interpretability researchers face a key problem: the model’s activations are a mix of many different features. In the early days of mechanistic interpretability, researchers hoped that the features in the neural network’s activations would overlap with individual neurons, i.e., information nodes. Unfortunately, in practice, neurons are busy for many unrelated features. This means that there is no obvious way to tell which features are part of the activation.
This is where infrequent autoencoders come in handy.
A given activation will only be a mix of a tiny number of features, even though the language model is likely to be able to detect millions or even billions of them – iethe model uses functions rarelyFor example, a language model will take relativity into account when answering a question about Einstein, and it will take eggs into account when writing about omelets, but it will probably not take relativity into account when writing about omelets.
Distributed autoencoders take advantage of this fact to discover a set of possible features and break down each activation into a tiny number of them. The researchers hope that the best way for a distributed autoencoder to accomplish this task is to find the actual underlying features that the language model uses.
Importantly, at no point in this process do we, the researchers, tell the limited autoencoder what features to look for. As a result, we are able to discover wealthy structures that we did not anticipate. However, because we do not know in advance meaning we are looking for the discovered features significant patterns in text examples where the limited autoencoder reports that a function is “running”.
Here’s an example where tokens where the function is triggered are highlighted with blue gradients depending on their strength:
Example activations for a feature found by our limited autoencoders. Each bubble is a token (a word or word fragment), and the varying blue color illustrates how strongly the feature is present. In this case, the feature is most clearly related to idioms.
What makes Gemma Scope unique
Previous research on limited autoencoders has mainly focused on investigating their internal mechanisms. small models Or single layer in larger modelsHowever, more ambitious interpretability research involves decoding layered, intricate algorithms into larger models.
We trained limited autoencoders in everyone layer and sublayer output Gemma 2 2B AND 9B to build Gemma Scope, producing over 400 limited autoencoders with over 30 million trained features in total (although many features likely overlap). This tool will enable researchers to study how features evolve throughout the model and interact and combine with each other to create more intricate features.
Gemma Scope is also trained in our up-to-date state-of-the-art JumpReLU SAE Architecture. The original limited autoencoder architecture struggled to balance the two goals of detecting which features are present and estimating their strength. The JumpReLU architecture helps to achieve this balance by significantly reducing errors.
Training so many limited autoencoders was a significant engineering challenge, requiring a lot of computational power. We used about 15% of the Gemma 2 9B training computation (excluding the computations for generating distillation labels), saved about 20 Pebibytes (PiB) of activations on disk (about as much as a million copies of the English Wikipedia) and generated a total of hundreds of billions of distributed autoencoder parameters.
Pushing the field forward
By releasing Gemma Scope, we hope that Gemma 2 will become the premier model family for open research on mechanistic interpretability and will accelerate the work of the community in this area.
So far, the interpretability community has made great progress in understanding tiny models with limited autoencoders and developing appropriate techniques such as causal Interventions, automatic encirclement analysis, interpretation of functionsAND assessment rare autoencodersWith Gemma Scope, we hope the community will adapt these techniques to newfangled models, explore more intricate possibilities like chain of thought, and find real-world applications of interpretability, such as solving problems like hallucinations and jailbreaks that only arise with larger models.