Companies such as OpenAI and Google have been operating for some time touting advanced “reasoning” capabilities. How next big step in its latest artificial intelligence models. But a modern study by six Apple engineers shows that the mathematical “reasoning” exhibited by advanced, language-heavy models can be extremely brittle and unreliable in the face of seemingly insignificant changes to common benchmarking problems.
The fragility highlighted in these modern results confirms previous research suggesting that LLM applications of probabilistic pattern matching lack the formal understanding of the fundamental concepts necessary to achieve truly stalwart mathematical reasoning capabilities. “Current LLM schools are incapable of true logical reasoning,” researchers hypothesize based on these results. “Instead, they try to reproduce the reasoning steps observed in the training data.”
Mix it up
In “GSM-Symbolic: Understanding the Limits of Mathematical Reasoning in Big-Language Models” – out now as reprint paper— the six Apple researchers they started with A standardized set of over 8,000 GSM8K math word problems at the primary school levelThat is often used as a reference point for the elaborate reasoning capabilities of newfangled LLMs. They then take the novel approach of modifying parts of this test set to dynamically replace certain names and numbers with modern values - so the question of whether Sophie gets 31 items for her nephew in GSM8K could become a question of whether Bill gets 19 items for his nephew his brother in the modern GSM-Symbolic rating.
This approach helps avoid potential “data contamination” that can result from inert GSM8K questions being entered directly into the AI model’s training data. At the same time, these random changes don’t change the actual difficulty inherent in mathematical reasoning at all, which means that in theory the models should perform just as well when tested in GSM-Symbolic as they do in GSM8K.
Instead, when researchers tested more than 20 state-of-the-art LLMs on the GSM-Symbolic network, they found that average accuracy was generally lower compared to GSM8K, and performance dropped by 0.3 to 9.2%, depending on the model. The results also showed a vast discrepancy across 50 separate GSM-Symbolic runs with different names and values. Within a single model, accuracy gaps of 15 percent between the best and worst runs were common, and for some reason changing the numbers usually resulted in worse accuracy than changing the names.
These kinds of discrepancies – both across the different GSM-Symbolic runs and compared to the GSM8K results – are more than a little surprising because, as the researchers note, “the general reasoning steps needed to solve the problem remain the same.” The fact that such petite changes lead to such variable results suggests to researchers that these models do not constitute any “formal” reasoning but instead “attempt to[ing] to perform a type of distribution pattern matching, matching the given questions and solution steps to similar ones seen in the training data.
Don’t get distracted
Still, the overall variance shown in the GSM-Symbolic tests was often relatively small in the grand scheme of things. For example, OpenAI’s ChatGPT-4o accuracy dropped from 95.2% in GSM8K to a still impressive 94.9% in GSM-Symbolic. This is a fairly high success rate for any benchmark, regardless of whether the model itself uses “formal” reasoning behind the scenes (though the overall accuracy of many models dropped dramatically when researchers added just one or two additional logical steps to the problems).
However, the tested LLMs performed significantly worse when Apple researchers modified the GSM-Symbolic benchmark to add “apparently relevant but ultimately irrelevant statements” to the questions. For this “GSM-NoOp” (short for “no Operation”) comparison set, the question about how many kiwis someone harvests in a few days can be modified to include the incidental detail that “five of them [the kiwis] were slightly smaller than average.”
Adding these errors led to what the researchers called a “catastrophic performance drop” in accuracy compared to GSM8K, ranging from 17.5% to as much as 65.7%, depending on the model tested. These massive drops in accuracy highlight the inherent limitations of using plain “pattern matching” to “transform instructions into operations without truly understanding their meaning,” the researchers write.
