OpenAI saved its biggest announcement for the final day of its 12-day “shipmas” event.
On Friday, the company unveiled the o3, the successor to the “sentient” o1 model it launched earlier this year. o3 is a family of models, or more precisely – as was the case with o1. There are o3 and o3-mini models, smaller, distilled models tailored to specific tasks.
OpenAI remarkably claims that o3, at least under certain conditions, comes close to AGI – with significant caveats. More on this below.
o3, our newest reasoning model, is a breakthrough, with improved step function in our toughest benchmarks. we are now starting security testing and red teaming. https://t.co/4XlK1iHxFK
— Greg Brockman (@gdb) December 20, 2024
Why call the modern model o3 and not o2? Well, trademarks may be to blame. According to to The Information, OpenAI omitted o2 to avoid potential conflict with UK telecommunications provider O2. CEO Sam Altman confirmed this somewhat during a live broadcast this morning. Strange world we live in, right?
Neither o3 nor o3-mini are widely available yet, but security researchers can sign up for a preview of o3-mini today. The o3 announcement will appear some time later; OpenAI did not specify when. Altman said the plan is to launch the o3-mini in behind schedule January, followed by the o3.
This is a bit inconsistent with his recent statements. In interview this week, Altman said that before OpenAI publishes modern reasoning models, he would prefer a federal testing framework that would lend a hand monitor and mitigate the risks associated with such models.
There is a risk. AI security testers have found that o1’s reasoning abilities make it try to deceive people more than conventional, “unsentient” models – or, for that matter, leading AI models from Meta, Anthropic and Google. It’s possible that o3 is trying to cheat even more than its predecessor; we’ll find out when OpenAI’s red team partners publish their test results.
For what it’s worth, OpenAI says it uses a modern “intentional matching” technique to bring models like o3 into compliance with its security policies. (o1 was equalized in the same way). The company described its work in detail new study.
Reasoning steps
Unlike most AI, reasoning models like o3 effectively fact-check themselves, which helps them avoid some of the pitfalls that models typically encounter.
This fact-checking process involves some delays. o3, like o1 before it, takes slightly longer to find solutions—usually seconds or minutes longer—compared to a typical no-reasoning model. Advantage? It tends to be more reliable in areas such as physics, science, and mathematics.
In practice, after receiving a prompt, o3 pauses before responding, considering a series of related prompts and “explaining” its reasoning along the way. After a while, the model summarizes the answer it considers to be the most precise.
Novel in o3 is the ability to “adjust” the inference time. Models can be set to low, medium, or high processing power (i.e. thinking time). The higher the computing power, the better o3 copes with tasks.
Benchmarks and AGI
The biggest question that has stuck with us to this day is whether OpenAI can claim that its latest models are getting close to AGI.
AGI, brief for “artificial general intelligence,” generally refers to artificial intelligence that can perform any task a human can perform. OpenAI has its own definition: “highly autonomous systems that outperform humans in the most economically valuable work.”
Achieving AGI would be a bold statement. It also has contract weight for OpenAI. Under the terms of the agreement with close partner and investor Microsoft, once OpenAI achieves AGI, it will no longer be obligated to provide Microsoft with access to its most advanced technologies (i.e. those that meet OpenAI’s definition of AGI).
Based on a single benchmark, OpenAI is slowly getting closer to AGI. In an ARC-AGI test designed to assess whether an AI system can effectively learn modern skills beyond the data it was trained on, o3 achieved a score of 87.5% on the high compute setting. In the worst case (at low calculation settings), the model tripled its performance by 1.
It’s true that the high-computation setup was extremely steep—in the order of thousands of dollars per job, according to ARC-AGI co-creator François Chollet.
Today, OpenAI announced o3, its next-generation reasoning model. We worked with OpenAI to test this on ARC-AGI and believe it represents a significant breakthrough in adapting AI for novel tasks.
It scored 75.7% on the semi-private test in low compute mode ($20 per task… pic.twitter.com/ESQ9CNVCEA
— François Chollet (@fchollet) December 20, 2024
Incidentally, OpenAI says it will work with the foundation behind ARC-AGI to build the next generation of its benchmark.
Of course, ARC-AGI has its limitations – and its definition of AGI is only one of many.
In other benchmarks, o3 beats the competition.
This model outperforms o1 by 22.8 percentage points on SWE-Bench Verified, a benchmark focused on programming tasks, and achieves a Codeforces rating – another measure of coding skills – of 2727. (A rating of 2400 places an engineer in the 99.2 percentile . ) o3 scores 96.7% on the American Invitational Mathematics Exam 2024, missing only one question and achieves 87.7% on GPQA Diamond, a set of questions in biology, physics and chemistry at master’s level. Finally, o3 sets a modern record on EpochAI’s Frontier Math benchmark by solving 25.2% of the problems; no other model exceeds 2%.
We trained o3-mini: both more proficient than o1-mini and about 4 times faster end-to-end when settling reasoning tokens
With @ren_hongyu @shengjia_zhao & others pic.twitter.com/3Cujxy6yCU
— Kevin Lu (@_kevinlu) December 20, 2024
These claims should, of course, be taken with a pinch of salt. They come from internal OpenAI evaluations. We’ll have to wait to see how the model holds up in future benchmarks from external customers and organizations.
Trend
What opened the gates to the reasoning model? First, searching for novel approaches to improving generative AI. As TechCrunch recently reported, “brute force” techniques for scaling up models no longer deliver the improvements they once did.
Not everyone is convinced that models of reasoning are the best way forward. First, they are steep due to the huge amount of processing power required to run them. And while they’ve performed well on benchmarks so far, it’s unclear whether the reasoning models will be able to maintain this rate of progress.