Monday, March 9, 2026

How Transformers Think: The Information Flow that Makes Language Models Work

Share

How Transformers Think: The Information Flow that Makes Language Models Work
Photo by the editor

# Entry

Thanks immense language models (LLM), we now have impressive, extremely useful applications such as Twins, ChatGPTAND Claudiusjust to name a few. However, not many people realize that the basic architecture of LLM is called transformer. This architecture has been carefully designed to “think” – that is, process data describing human language – in a very specific and somewhat unique way. Are you interested in gaining a broad understanding of what is happening inside these so-called transformers?

This article describes, in a gentle, understandable, and rather non-technical tone, how the transformer models behind LLM parse input information such as user prompts, and how they generate consistent, meaningful, and relevant text output word by word (or, a bit more technically, token by token).

# Initial steps: making language understandable by machines

The first key concept to understand is this AI models don’t actually understand human language; they only understand numbers and act on them, and the transformers behind LLM are no exception. Therefore, it is necessary to transform human language – that is, text – into a form that the transformer can fully understand before it can deeply process it.

In other words, the first few steps before getting into the core and innermost layers of the transformer are primarily focused on transforming this raw text into a numerical representation that retains the key properties and features of the original text under the hood. Let’s analyze these three steps.

Making language understandable by machinesMaking language understandable by machines
Making language understandable by machines (click to enlarge)

// Tokenization

The tokenizer is the first actor to appear on the scene, interacting with the transformer model and responsible for dividing the raw text into petite pieces called tokens. Depending on the tokenizer used, in most cases these tokens can be equivalent to words, but sometimes tokens can also be parts of words or punctuation marks. Moreover, each token in a given language has a unique numeric identifier. This happens when the text no longer becomes text, but numbers: all at the token level, as shown in this example, where a plain tokenizer converts text containing five words into five token IDs, one per word:

Tokenization of text into token IDsTokenization of text into token IDs
Tokenization of text into token IDs

// Token embedding

Each token ID is then transformed into a ( d )-dimensional vector, which is a list of numbers of size ( d ). This complete representation of a token as an embedding is like a description of the overall meaning of that token, whether it is a word, part of it, or a punctuation mark. The magic is that tokens are associated with similar concepts of meaning, such as: queen AND empresswill have associated embedding vectors that are similar.

// Positional coding

Previously, a token embedding contained information in the form of a set of numbers, but this information was still associated with a single token. However, in a “piece of language” such as a sequence of texts, it is vital not only to know the words or symbols it contains, but also their position in the text of which they are part. Positional encoding is a process that uses mathematical functions to introduce additional information about its position in the original text sequence into each token.

# Transformation through the core of the transformer model

Now that the numerical representation of each token contains information about its position in the text sequence, it is time to enter the first layer of the main body of the transformer model. The transformer has a very deep architecture, with many stacked components replicated throughout the system. There are two types of transformer layers – the encoder layer and the decoder layer – but for simplicity, we will not make a detailed distinction between them in this article. For now, remember that there are two types of layers in a transformer, even though both have much in common.

Transformation through the core of the transformer modelTransformation through the core of the transformer model
Transformation through the core of the transformer model (click to enlarge)

// Multi-headed note

This is the first major subprocess that occurs in the transformer layer, and perhaps the most influential and distinctive feature of transformer models compared to other types of AI systems. Polyheaded attention is a mechanism that allows a token to observe or “pay attention” to other tokens in a sequence. It collects and incorporates useful contextual information into its own symbolic representation, namely linguistic aspects such as grammatical relationships, relationships between words that are not necessarily adjacent in the text, or semantic similarities. To sum up, this mechanism manages to capture various aspects of relevance and relationships between parts of the original text. As the token representation passes through this component, it gains a richer, more context-aware representation of itself and the text to which it belongs.

Some transformer architectures built for specific tasks, such as translating text from one language to another, also analyze possible relationships between tokens through this mechanism, looking at both the input text and the output (translated) text generated so far, as shown below:

Multi-headed attention in translation transformersMulti-headed attention in translation transformers
Multi-headed attention in translation transformers

// Feedback neural network sublayer

Simply put, the second common stage in each replicated transformer layer is a set of interconnected neural network layers that further process and facilitate learn additional patterns of our enriched token representations. This process is like further sharpening these representations, identifying and strengthening vital features and patterns. Ultimately, these layers constitute a mechanism for gradually learning a general, increasingly abstract understanding of the entire text being processed.

The process of passing through the multi-headed attention and feedback sublayers is repeated many times in this order: as many times as we have replicated transformer layers.

// Final destination: next word prediction

After repeating the previous two steps multiple times, alternating symbolic representations from the initial text should allow the model to gain a very deep understanding, enabling it to recognize complicated and subtle relationships. At this point we come to the last piece of the transformer stack: a special layer that transforms the final representation into a probability for each possible tag in the dictionary. This means that we calculate – based on all the information acquired along the way – the probability that each word in the target language will be the next word that the transformer model (or LLM) should output. The model ultimately selects the token or word with the highest probability next, which it generates as part of the result for the end user. The entire process is repeated for each word generated in the model’s response.

# Summary

This article provides a sensitive and conceptual guide to the journey experienced by textual information flowing through the distinctive model architecture behind the LLM: the transformer. After reading this, you may hope to have a better understanding of what goes on in models like the ones behind ChatGPT.

Ivan Palomares Carrascosa is a thought leader, writer, speaker and advisor in the fields of Artificial Intelligence, Machine Learning, Deep Learning and LLM. Trains and advises others on the utilize of artificial intelligence in the real world.

Latest Posts

More News