Studies have shown that huge language models (LLM) tend to process information at the beginning and at the end of the document or conversation, while neglecting the center.
This “position bias” means that if a lawyer uses a virtual LLM assistant to recover a specific phrase in a 30-page statement, LLM will more often find the right text if he is on the initial or final pages.
Myth researchers have discovered the mechanism of this phenomenon.
They created a theoretical framework to explore how the information flows through the architecture of machine learning that creates the LLM spine. They found that some design choices that control the method of processing input data may cause a position error.
Their experiments revealed that the architecture of the model, especially those that affect the way of distributing information into the input words in the model, can cause or intensify the bias of positions, and the training data also contribute to the problem.
In addition to indicating the beginnings of position bias, their frames can be used to diagnose and correct in future models.
This can lead to more reliable chatbots, which remain on the subject during long conversations, AI medical systems that reason more honestly when servicing patient data, and assistants of the Code who pay closer attention to all parts of the program.
“These models are black boxes, so as a LLM user, you probably don’t know that prejudice to the position may cause your model to be inconsistent. You just feed your documents in any order and expects it to work. But by understanding the basic mechanism of these black models (Lats ID) and IDS IDS IDS IDS FOR Information Information ID first author paper about these studies.
Her co -authors are Yifei Wang, Postdoc myth; and older authors Stefan Jegelka, professor of electrical and computer science (EECS) and IDSS member as well as an IT laboratory and artificial intelligence (CSAIL); and Ali Jadbabaie, professor and head of the Citizens’ and Environment Department, main member of the IDSS department and the main researcher at Lids. Research will be presented at an international conference on machine learning.
Analyzing attention
LLM, such as Claude, Lama and GPT-4, are powered by the type of architecture of the neural network known as a transformer. Transformers are designed to process sequential data, coding sentences into fragments called tokens, and then learning the relationship between tokens to predict what words will be next.
These models have become very good because of the attention mechanism, which uses combined layers of data processing nodes to understand the context, enabling tokens to selectively focus on tokens on selective focusing or participating in it.
But if any token can take care of any other token in a 30-page document, it quickly becomes difficult to computate. So, when engineers build transformer models, they often use attention masking techniques that limit the words to which the token can take care of.
For example, the causal mask only allows the words that came to her.
Engineers also use position coding to help the model understand the location of each word in a sentence, improving performance.
MIT scientists have built the theoretical framework based on charts to examine how these modeling choices, attention masks and position coding can affect position prejudices.
“Everything is coupled and entangled in the mechanism of attention, so it is very arduous to explore. The charts are a elastic language to describe the dependent relationship between words in the mechanism of attention and track them through many layers,” says Wu.
Their theoretical analysis suggested that causal masking gives the model an inherent deviation to the beginning of the input data, even when this bias do not exist in the data.
If earlier words are relatively irrelevant to the meaning of the sentence, causal masking may cause the transformer to pay more attention to its beginning.
“It is often true that earlier words and later words in a sentence are more essential if LLM is used in a task that is not a natural language generation, such as ranking or searching for information, these prejudices can be extremely harmful,” says Wu.
As the model increases, with additional layers of the attention mechanism, this deviation is strengthened, because earlier parts of the input data are more often used in the process of modeling the model.
They also discovered that the use of positional coding to combine words with nearby words can alleviate the bias of position. This technique focuses the attention of the model in the right place, but its effect can be diluted in models with more layers of attention.
These design choices are only one cause of position bias – some may come from training data, which the model uses to learn to priority to treat words in the sequence.
“If you know that your data is biased in a certain way, you should also finance your model in addition to regulating modeling choices,” says Wu.
Lost inside
After establishing theoretical framework, scientists conducted experiments in which they systematically changed the position of the correct response in text sequences in order to obtain the task of searching for information.
The experiments showed the phenomenon of “lost in medium”, in which the search accuracy was consistent with the U -shaped formula. The models worked best if the correct answer was at the beginning of the sequence. The efficiency dropped, the closer he approached the inside, before it bounced a little, if the correct answer was similar to the end.
Ultimately, their work suggests that using another masking technique, removing additional layers from the attention mechanism or strategic use of positional coding can reduce the position error and improve the accuracy of the model.
“When performing a combination of theories and experiments, we were able to look at the consequences of the design elections of models that were not clear then. If you want to operate the model in high rate applications, you need to know when it will work, when it will not be and why,” says Jadbabaie.
In the future, scientists want to further examine the effects of positional coding and examine how the bias of position can be strategically used in some applications.
“These researchers offer a sporadic theoretical lens for the mechanism of attention in the heart of the transformer model. They provide a convincing analysis that explains long-term quirks in the behavior of the transformer, showing that mechanisms of attention, especially with causal masks, by nature, models of bias towards the beginnings of the sequence. Paper achieves the best of both worlds-the mathematical clarity combined with incidents, which reach the power of implementation.
These studies are partly supported by the American Office of Naval Research, National Science Foundation and Professor Alexander von Humboldt.
