Saturday, March 7, 2026

Gemma Scope 2: Helping the AI ​​security community deepen its understanding of the behavior of elaborate language models

Share

We are announcing a recent open source toolkit for interpreting language models

Vast language models (LLMs) are capable of incredible feats of reasoning, but their internal decision-making processes remain largely unclear. If a system isn’t behaving as expected, the lack of insight into its inner workings can make it tough to pinpoint the exact cause of its behavior. Last year, we advanced the science of interpretability with Gemma Scope, a toolkit designed to assist researchers understand the inner workings of Gemma 2, our lightweight collection of open models.

Today we are slowing down Gemma Range 2: a comprehensive, open set of interpretation tools for all Gemma 3 sizes, from 270M to 27B. These tools can enable us to track potential threats throughout the “brain” of the model.

To our knowledge, this is the largest-ever release of open-source interpretation tools from an AI lab. The production of Gemma Scope 2 required storing approximately 110 petabytes of data and training over 1 trillion total parameters.

As AI continues to advance, we look forward to the AI ​​research community using Gemma Scope 2 to debug emerging model behavior, using these tools to better audit and debug AI agents, and ultimately accelerating the development of practical and strong security interventions against problems such as prison breaks, hallucinations, and sycophancy.

Our Gemma 2 interactive telescope A demo is available to try out courtesy of Neuronpedia.

What’s recent in Gemma Scope 2

Interpretability research aims to understand the inner workings and learned algorithms of artificial intelligence models. As AI becomes more powerful and elaborate, interpretability is critical to creating secure and reliable AI.

Like its predecessor, Gemma Scope 2 acts as a microscope for the Gemma family of language models. By combining scant autoencoders (SAEs) and transcoders, it allows researchers to look inside models, see what they are thinking about, and how those thoughts arise and connect to the model’s behavior. This, in turn, enables richer exploration of jailbreaks or other security-relevant AI behavior, such as discrepancies between the reasoning conveyed by a model and its internal state.

While the original Gemma Scope enabled research in key safety areas such as model hallucination, identifying secrets known to the modelAND training safer modelsGemma Scope 2 supports even more ambitious research with significant improvements:

  • Full coverage at scale: We provide a complete set of tools for the entire Gemma 3 family (up to 27B parameters), necessary to study emergent behaviors that only appear on a gigantic scale, such as these previously discovered by the 27b C2S scale model, which helped discover a recent potential path for cancer therapy. While Gemma Scope 2 is not trained on this model, it is an example of the type of emergent behavior these tools may be able to understand.
  • More sophisticated tools to decipher elaborate internal behavior: Gemma Scope 2 includes SAE and transcoders trained on every layer of our Gemma 3 family of models. Skip transcoders AND Cross-layer transcoders make it easier to decipher multi-step calculations and algorithms distributed throughout the model.
  • Advanced training techniques: We exploit the most newfangled techniques, in particular Matryoshka training techniquewhich helps SAE detect more useful concepts and addresses some flaws discovered in Gemma Scope.
  • Tools for analyzing chatbot behavior: We also provide interpretation tools tailored to Gemma 3 versions tailored for chat applications. These tools enable the analysis of elaborate, multi-step behaviors such as jailbreaks, denial mechanisms, and chain of thought fidelity.

Latest Posts

More News