If you’ve ever tried using ChatGPT as a calculator, you’ve almost certainly noticed the following: The chatbot is bad at math. And it is no exception among artificial intelligence in this respect.
Anthropic Claudius I can’t solve basic word problems. Twins doesn’t understand quadratic equations. And Meta Lama struggles with simplicity addition.
So how is it that these bots can write monologues and yet still get carried away with grade-school arithmetic?
Tokenization has something to do with this. The process of dividing data into chunks (e.g. splitting the word “fantastic” into the syllables “fan”, “tas” and “tic”) tokenization helps the AI densely encode information. However, since tokenizers – the AI models that tokenize – don’t actually know what the numbers are, they often end up failing destroying relationships between digits. For example, a tokenizer might treat the number “380” as one token, but represent “381” as a pair of digits (“38” and “1”).
But tokenization isn’t the only reason why math is a delicate point in AI.
AI systems are statistical machines. Trained on many examples, they learn patterns from those examples to make predictions (e.g., the phrase “to whom” in an email often precedes the phrase “may concern”). For example, given the multiplication problem 5.7897 x 1.2832, ChatGPT – after seeing many multiplication problems – will likely infer the product of a number ending in “7” and a number ending in “2” will end in “4. ” But there will be a problem with the middle part. ChatGPT gave me the answer 742 021 104; the correct one is 742 934 304.
Yuntian Deng, an assistant professor at the University of Waterloo specializing in artificial intelligence, closely compared ChatGPT’s multiplication capabilities in a study conducted earlier this year. He and his co-authors found that the default model, GPT-4o, had difficulty multiplying above two numbers with more than four digits each (e.g., 3459 x 5284).
“GPT-4o struggles with multi-digit multiplication, achieving less than 30% accuracy for four-digit-to-four-digit problems,” Deng told TechCrunch. “Multiplication of multiple digits poses a challenge to language models because error at any intermediate step can accumulate, leading to incorrect final results.”
So will math skills escape ChatGPT forever? Or is there reason to believe that one day a bot will become as adept at using numbers as humans (or TI-84, for that matter)?
Deng is hopeful. In the study, he and his colleagues also tested o1, an OpenAI “reasoning” model that recently appeared in ChatGPT. O1, which “thinks through” problems step by step before answering them, performed significantly better than GPT-4o, solving nine-digit-by-nine-digit multiplication problems about half the time.
“A model can solve a problem in a different way than what we solve manually,” Deng said. “This makes us curious about the model’s internal approach and how it differs from human reasoning.”
Deng believes that progress indicates that at least some types of math problems – multiplication problems being one of them – will eventually be “fully solved” by ChatGPT-like systems. “It’s a well-defined task with known algorithms,” Deng said. “We are already seeing significant improvement from GPT-4o to o1, so it is clear that there is an improvement in reasoning ability.”
Just don’t get rid of your calculator any time soon.