However, interpretability researchers face a key problem: model activations are a mixture of many different features. At the beginning of the mechanistic interpretation, researchers hoped that the activation characteristics of the neural network would be consistent with individual neurons, i.e., information nodes. Unfortunately, in practice, neurons are lively for many unrelated functions. This means there is no obvious way to tell which features are part of the activation.
This is where sporadic autoencoders come into play.
A given activation will only be a mixture of a petite number of features, even though the language model will likely be able to detect millions or even billions of them – i.e. the model uses the features infrequently. For example, a language model will take relativity into account when answering a question about Einstein and eggs when writing about omelets, but it will probably not take relativity into account when writing about omelets.
A few autoencoders utilize this fact to discover a set of possible functions and divide each activation into a petite number. Researchers hope that the best way for a meager autoencoder to accomplish this task is to find the actual underlying functions used in the language model.
Importantly, at no point in this process do we, the researchers, tell the meager autoencoder what features to look for. Thanks to this, we are able to discover wealthy structures that we did not expect. However, since we do not immediately know the meaning of the discovered features, we search meaningful patterns in text examples where the meager autoencoder says the function is “running”.
Here is an example where the tokens where the feature is activated are highlighted with gradients of blue depending on their strength:
