Wednesday, May 14, 2025

DeepMind’s Michelangelo benchmark exposes the limitations of long-context LLMs

Share


Join our daily and weekly newsletters to receive the latest updates and exclusive content on our industry-leading AI coverage. Find out more


Huge language models (LLMs) with very long context windows have been making headlines lately. The ability to cram hundreds of thousands or even millions of tokens onto a single line opens up many possibilities for developers.

But how well do these long-term LLMs really understand and utilize the huge amounts of information they receive?

Scientists from Google DeepMind they presented Michelangeloa recent benchmark designed to assess the ability to reason in a long-term LLM context. Their findings, published in a recent research paper, show that while current frontier models have made progress in retrieving information from enormous contextual data, they still struggle with tasks requiring inference about data structure.

Better long-term benchmarking is needed

The emergence of LLMs with extremely long context windows, ranging from 128,000 to over 1 million tokens, has led researchers to develop recent benchmarks to evaluate their capabilities. However, the greatest emphasis has been on search tasks, such as the popular “needle in a haystack” assessment, in which the model’s task is to find specific information in a enormous context.

“Over time, the models have significantly improved long-term performance,” Kiran Vodrahalli, a researcher at Google DeepMind, told VentureBeat. “For example, the popular needle-in-a-haystack search method is now well saturated and covers extremely long contexts. Therefore, it has become significant to determine whether the models are able to solve more arduous tasks in a compact context and that the systems are also capable of solving over long distances.

Search tasks do not necessarily reflect the model’s ability to reason across context. The model may be able to find a specific fact without understanding the relationships between different parts of the text. Meanwhile, existing benchmarks that assess a model’s ability to reason in long contexts have limitations.

“It is easy to develop long-reasoning assessments that can be solved by combining only search and information stored in the model’s weights, which short-circuits the test of the model’s ability to use long context,” Vodrahalli said.

Michelangelo

To address the limitations of current standards, researchers introduced Michelangelo, “a minimal, synthetic, and unleaked assessment of long-context reasoning for large language models.”

Michelangelo relies on the analogy of a sculptor cutting out irrelevant pieces of marble to reveal the underlying structure. The benchmark focuses on assessing the model’s ability to understand the relationships and structure of information in a context window, rather than simply retrieving isolated facts.

The benchmark consists of three basic tasks:

Hidden list: The model must process the long sequence of operations performed on the Python list, filter out irrelevant or redundant statements, and determine the final state of the list. “The hidden means list measures a model’s ability to track properties of a hidden data structure over the course of a stream of code instructions,” the researchers write.

Multi-round co-reference resolution (MRCR): The model must create parts of a long conversation between the user and the LLM. This requires the model to understand the structure of the conversation and resolve references to previous turns, even if the conversation contains confusing or distracting elements. “MRCR measures the model’s ability to understand order in natural text, distinguish similar versions of writing, and recover a specific fragment of previous context in the case of difficult queries,” the researchers write.

“I don’t know” (IDK): The model is given a long story and asked to answer multiple-choice questions about it. For some questions, the context does not provide answers, and the model must be able to recognize the limits of its knowledge and answer “I don’t know”. “IDK measures a model’s ability to understand whether it knows what it doesn’t know, based on the context it is presented with,” the researchers write.

Hidden structure queries

Tasks in Michelangelo are based on a novel framework called latent structure queries (LSQ). LSQ provides a general approach to designing long-context reasoning assessments that can be extended to any length. It can also test the model’s understanding of hidden information versus retrieving uncomplicated facts. LSQ relies on test data synthesis to avoid the pitfalls of leaking test data into the training corpus.

“By requiring the model to extract information from structures rather than values ​​from keys (marble carvings rather than needles from haystacks), we can more deeply test the language model’s understanding of the context in a way that makes it impossible to retrieve,” the researchers write.

The LSQ has three key differences from other long-context LLM assessment approaches. First, it was explicitly designed to avoid biases in judgments that extend beyond search tasks. Second, it specifies a methodology for independently increasing task complexity and context length. Finally, it is general enough to capture a wide range of reasoning tasks. Michelangelo’s three tests include interpreting code and reasoning from loosely written text.

“The goal is that long-term, beyond-reasoning evaluations implemented by following the LSQ will lead to fewer scenarios in which the proposed evaluation is limited to solving the search task,” Vodrahalli said.

Evaluating Michelangelo’s Limit Models

Scientists evaluated ten borderline LLMs on Michelangelo, including various variants of Gemini, GPT-4 and 4o, and Claude. They tested the models in contexts of up to 1 million tokens. The Gemini models performed best in the MRCR test, the GPT models stood out in the latent list, and the Claude 3.5 Sonnet achieved the highest IDK scores.

However, all models showed a significant drop in performance as the complexity of the reasoning tasks increased, suggesting that even with very long context windows, current LLMs still have the potential to improve their ability to reason from enormous amounts of information.

Frontier LLM have difficulty reasoning in long context windows (source: arxiv)

“Boundary models can improve all the basic elements of reasoning beyond search (Latent List, MRCR, IDK) that we examine in the Michelangelo study,” Vodrahalli said. “Different boundary models have different strengths and weaknesses – each class performs well in different contexts and for different tasks. What seems to be universal across all models is an initial drop in performance on tasks requiring long reasoning.

Michelangelo’s assessments capture the fundamental elements necessary for long-term reasoning, and their findings may have significant implications for corporate applications. For example, in real-world applications where the model cannot rely on its pre-training knowledge and must perform multi-hop inference at many different locations in very long contexts, Vodrahalli expects performance to decline as the context length increases.

“This is especially true if documents contain a lot of information that is unrelated to a specific task, making it difficult for the model to immediately distinguish what information is relevant and what is not,” Vodrahalli said. “It is also likely that models will continue to perform well on tasks where all the relevant information needed to answer a question is located in one general place in the document.”

The researchers will continue to add more assessments of Michelangelo and hope to make them available directly so that other researchers can test their models on them.

Latest Posts

More News