
Image by editor (click to enlarge)
# Entry
Gigantic Language Models (LLM) they can do many things. They can generate text that looks coherent. They can answer human questions in human language. Among many other skills, they can also analyze and organize text from other sources. But are LLMs able to analyze and report their own internal states – the activations in their convoluted components and layers – in a meaningful way? Arrange differently, Can LLMs introspect??
This article provides an overview and summary of research conducted on the emerging topic of LLM introspection on internal states, i.e., introspective awareness, with additional insights and final conclusions. Specifically, we review and reflect on the research article Emerging introspective awareness in large-language models.
NOTE: this article uses first-person pronouns (I, me, my) to refer to the author of this entry, and, unless otherwise noted, the term “authors” refers to the original researchers of the work under analysis (J. Lindsey et al.).
# Key concept explained: Introspective awareness
The study authors define the concept of introspective model awareness – previously defined in other related works under subtly different interpretations – based on four criteria.
But first it’s worth understanding what it is LLM self-report Is. This can be understood as the model’s verbal description of what “internal reasoning” (or, more technically, neural activation) it thought it had when generating a response. As you might guess, this can be considered a subtle behavioral manifestation of the model’s interpretability, which (in my opinion) is more than enough to justify the importance of this research topic.
Let us now look at the four defining criteria of LLM introspective awareness:
- Accuracy: Introspective awareness means that the model’s self-report should correctly reflect activations or manipulations of its internal state.
- Ground: The self-description must be causally dependent on the internal state, causing changes in the latter and correspondingly updating the former.
- Inwardness: Internal activations will be used by LLM for self-reporting, rather than being confined to inferences based solely on generated text.
- Metacognitive representation: The model should enable the formulation of a higher-order internal representation, not just a direct translation of the achieved state. This is a particularly convoluted property to demonstrate and is beyond the scope of the authors’ research.
# Research methodology and key conclusions
More specifically, activation vectors or conceptual vectors associated with familiar concepts such as “rice” or “ramen”, or abstract nouns such as “peace” or “umami”, are extracted and injected into the LLM residual stream flowing from one layer of a given model to another. The model is then sent a prompt asking it to self-report whether a particular thought or idea was injected, and if so, which one it was. The experiment was repeated for each model considered, at different levels of disturbances, at different layers of the entire model architecture.
The most vital findings were:
- Models occasionally (e.g., approximately 20% of the time for Opus 4.1) self-reported “injected thoughts,” correctly naming the associated concept, indicating internal recognition of some degree of emerging thoughts.
- Similarly often (above chance) he was able to distinguish thoughts from input, i.e. repeat the entered text without errors and report on the injected internal thought.
- If the model is artificially forced to generate a specific token next in the response, and a matching conceptual vector is injected before the “pre-fill”, that result was more likely to be accepted as intended. This finding suggests that the model may be able to introspectively track its “own” prior intentions.
- After asking the model to think or avoid thinking about a specific word while generating unrelated outputs, they found that it may be able to modulate its internal representation of information in response to incentives, such as “(don’t) think about idea X and you’ll get a reward.” Nevertheless, this phenomenon tended to disappear until it reached the last layer.
# Final thoughts and summary
This is, in my opinion, a research topic of very high importance that deserves much study for several reasons: first, and most obviously, LLM introspection may be the key to better understanding not only the interpretability of LLM, but also long-standing problems such as hallucinations, unreliable reasoning in solving high-stakes problems, and other cloudy behaviors that can sometimes be observed in even the most novel models.
The experiments were labor-intensive and rigorously well-designed, and their results were quite obvious and signaled early but significant hints of introspective abilities in the intermediate layers of the models, although with varying levels of conclusiveness. Experiments are confined to models in the Claude family, and of course it would be captivating to see more diversity across architectures and model families beyond them. Nevertheless, it is understandable that there may be limitations here, such as confined access to internal activations in other types of models or practical limitations when examining proprietary systems, not to mention that the authors of this research masterpiece are associated with Anthropic Of course!
Ivan Palomares Carrascosa is a thought leader, writer, speaker and advisor in the fields of Artificial Intelligence, Machine Learning, Deep Learning and LLM. Trains and advises others on the apply of artificial intelligence in the real world.
