Every year, countries participating in the International Mathematical Olympiad (IMO) arrive with a booklet containing their best, most original problems. These brochures are shared with delegations and then quietly disappear. No one has ever systematically collected them, cleaned them and made them available, neither to artificial intelligence researchers testing the limits of mathematical reasoning, nor to students around the world, largely preparing for these competitions on their own.
This is exactly what scientists from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), King Abdullah University of Science and Technology (KAUST), and HUMAIN have done.
MathNet is the largest high-quality dataset of evidence-based math problems ever created. Featuring over 30,000 expert-generated problems and solutions, covering 47 countries, 17 languages and 143 competitions, it is five times larger than the next largest dataset of its type. The work will be presented later this month at the International Conference on Learning Representations (ICLR) in Brazil.
What makes MathNet stand out is not only its size, but also its width. Data sets at the level of previous Olympics come almost exclusively from competitions held in the United States and China. MathNet covers dozens of countries on six continents, covers 17 languages, features both text- and image-based problems and solutions, and spans four decades of competitive mathematics. The goal is to capture the full range of mathematical perspectives and problem-solving traditions that exist in the global mathematics community, not just the most perceptible ones.
“Each country brings a brochure with the most innovative and creative problems,” says Shaden Alshammari, a graduate student at MIT and lead author of the paper. “They share brochures, but no one has bothered to collect them, clean them and put them online.”
Building MathNet required tracing 1,595 volumes of PDF files totaling more than 25,000 pages, including decades-old digital documents and scans in more than a dozen languages. Much of this archive came from an unlikely source: Navid Safaei, a long-time IMO community activist and contributor who had been hand-picking and scanning these pamphlets since 2006. His personal archive formed much of the backbone of the dataset.
Resources are as essential as scale. While most existing math datasets pull problems from community forums such as Art of Problem Solving (AoPS), MathNet draws exclusively from official national competition brochures. The solutions in these booklets are expert-written and peer-reviewed, often spanning multiple pages, and the authors discuss several approaches to the same problem. This depth gives AI models a much richer signal for learning mathematical reasoning than the shorter, informal solutions typical of crowdsourced datasets. This also means that the dataset is truly useful for students: anyone preparing for an IMO or national competition now has access to a centralized, searchable collection of high-quality problems and proven solutions based on traditions from around the world.
“I remember for many students for whom it was an individual effort. No one in their country prepared them for this type of competition,” says Alshammari, who herself participated in IMO as a student. “We hope this will give them a centralized place with high-quality problems and solutions that they can learn from.”
The band has deep roots in the IMO community. Co-author Sultan Albarakati currently serves on the IMO Executive Board, and researchers are working to make the dataset available directly to the IMO Foundation. To validate the dataset, an evaluation group of over 30 evaluators from countries such as Armenia, Russia, Ukraine, Vietnam and Poland was assembled and together they verified thousands of solutions.
“The MathNet database can be an excellent resource for both students and leaders looking for new problems to work on or solutions to difficult questions,” says Tanish Patil, deputy leader of the Swiss IMO. “While other archives of Olympic problems exist (most notably the Contest Collections forums on AoPS), these resources lack the standardized formatting system, validated solutions, and important problem metadata that topics and theory require. It will also be interesting to see how this dataset is used to improve the performance of reasoning models, and whether we will soon be able to reliably answer an important problem when creating new Olympic questions: determining whether a problem is truly original.”
MathNet also functions as a strict benchmark for AI performance, and the results paint a more complicated picture than recent headlines about AI math skills might suggest. Pioneer models have made remarkable progress: some are reported to have achieved gold medals at the IMO, and in standardized tests they are now solving problems that would surprise most people. However, MathNet shows that progress is uneven. Even GPT-5, the best-performing model, averaged about 69.3% on MathNet’s main test of 6,400 problems, failing almost one in three Olympic-level problems. And when problems involve numbers, performance drops significantly across the board, exposing visual reasoning as a consistent faint point in even the most competent models.
Several open-source models scored 0 percent for Mongolian language problems, highlighting another dimension in which current AI systems fail despite their overall strength.
“GPT models are equally good in English and other languages,” says Alshammari. “But many open source models fail completely for less popular languages like Mongolian.”
The MathNet variety also aims to address deeper limitations in how AI models learn mathematics. When the training data moves towards English and Chinese problems, the models absorb a narrow slice of the mathematical culture. A Romanian combinatorics problem or a Brazilian number theory problem can approach the same basic concept from a completely different point of view. Scientists argue that exposure to this range makes both humans and artificial intelligence systems better at thinking mathematically.
In addition to solving problems, MathNet introduces a search benchmark that asks whether models can recognize when two problems have the same underlying mathematical structure. This is an opportunity that has implications both for the development of artificial intelligence and for the mathematics community itself. There have been almost recurring problems in real IMO exams over the years because finding mathematical equivalents in different notations, languages and formats is really challenging, even for human expert boards. Testing eight state-of-the-art embedding models, researchers found that even the strongest identified the correct match only about 5 percent of the time on the first try, with the models often rating structurally unrelated problems as more similar than equivalent.
The dataset also includes a recovery-assisted generation benchmark, testing whether adding a structural problem to a model before asking it to solve a up-to-date one improves performance. This happens, but only if the problem found is really significant. DeepSeek-V3.2-Speciale gained up to 12 percentage points on well-tried downloads, while irrelevant downloads degraded performance approximately 22 percent of the time.
Alshammari wrote the paper with Safaei, HUMAIN AI engineer Abrar Zainal, KAUST Academy Director Sultan Albarakati, and MIT CSAIL colleagues: graduate student Kevin Wen SB ’25; Microsoft Principal Engineering Manager Mark Hamilton SM ’22, PhD ’25; and Professors William Freeman and Antonio Torralba. Their work was funded in part by a Schwarzman College of Computing Fellowship and the National Science Foundation.
MathNet is publicly available at mathnet.csail.mit.edu.
