Gemma Scope: Helping the security community shed featherlight on the inner workings of language models

Share

However, interpretability researchers face a key problem: model activations are a mixture of many different features. At the beginning of the mechanistic interpretation, researchers hoped that the activation characteristics of the neural network would be consistent with individual neurons, i.e., information nodes. Unfortunately, in practice, neurons are lively for many unrelated functions. This means there is no obvious way to tell which features are part of the activation.

This is where sporadic autoencoders come into play.

A given activation will only be a mixture of a petite number of features, even though the language model will likely be able to detect millions or even billions of them – i.e. the model uses the features infrequently. For example, a language model will take relativity into account when answering a question about Einstein and eggs when writing about omelets, but it will probably not take relativity into account when writing about omelets.

A few autoencoders utilize this fact to discover a set of possible functions and divide each activation into a petite number. Researchers hope that the best way for a meager autoencoder to accomplish this task is to find the actual underlying functions used in the language model.

Importantly, at no point in this process do we, the researchers, tell the meager autoencoder what features to look for. Thanks to this, we are able to discover wealthy structures that we did not expect. However, since we do not immediately know the meaning of the discovered features, we search meaningful patterns in text examples where the meager autoencoder says the function is “running”.

Here is an example where the tokens where the feature is activated are highlighted with gradients of blue depending on their strength:

The AI Sckool

Categories

Gemma Scope: Helping the security community shed featherlight on the inner workings of language models

5 Powerful Python Decorators for Optimizing LLM Applications

War with Iran threatens global chip supplies and the expansion of artificial intelligence

Trump’s war with Iran could upend American farmers

ByteDance’s artificial intelligence ambitions are hampered by computational limitations and copyright concerns

OpenAI banned military applications. The Pentagon tested its models through Microsoft anyway

More News

Gemini 3.1 Flash-Lite: Built for intelligence at scale

Gemini 3.1 Pro: a smarter model for the most convoluted tasks

A up-to-date way to express yourself: Gemini can now create music

Accelerating discovery in India with AI-powered science and education

5 Powerful Python Decorators for Optimizing LLM Applications

War with Iran threatens global chip supplies and the expansion of artificial intelligence

Trump’s war with Iran could upend American farmers