Tuesday, December 24, 2024

OpenAI o3 suggests that AI models scale in novel ways – but so do the costs

Share

Last month, AI developers and investors told TechCrunch that we are now in a “second era of scaling regulations,” noting that established methods for improving AI models have diminishing returns. One promising novel method they suggested that could maintain gains was “test-time scaling,” which seems to account for the performance of OpenAI’s o3 model, but has its drawbacks.

Much of the AI ​​world welcomed the announcement of OpenAI’s o3 model as evidence that progress in AI scaling had not “hit a wall.” The o3 model performs well in benchmarks, significantly outperforming all other models on a general skills test called ARC-AGI and scoring 25% on the test difficult math test that no other AI model scored higher than 2%.

Of course, we at TechCrunch are taking all of this with a grain of salt until we can test o3 ourselves (so far, very few have tried it). But even before the premiere of o3, the world of artificial intelligence was already convinced that something huge had changed.

“We have reason to believe that this trajectory will continue,” Brown said in: tweet.

Anthropic co-founder Jack Clark said in: blog entry on Monday that o3 is proof that artificial intelligence “will progress faster in 2025 than in 2024.” (Remember that Anthropic has benefits – especially its ability to raise capital – suggesting that AI scaling regulations will continue even as Clark complements the competitor.)

Over the next year, Clark says the AI ​​world will combine test-time scaling with conventional pre-training scaling methods to drive even greater gains from AI models. Perhaps it suggests that Anthropic and other AI model providers will publish their own reasoning models in 2025, as Google did last week.

Scaling testing time means that OpenAI uses more processing power during the ChatGPT inference phase, which is the period of time after pressing Enter at the prompt. It’s unclear what exactly is happening behind the scenes: OpenAI either uses more computer chips to answer user questions, runs more powerful inference chips, or runs those chips for longer periods of time — in some cases 10 to 15 minutes — ahead of the AI gives the answer. We don’t know all the details about the creation of o3, but these benchmarks are early signs that scaling during testing can improve the performance of AI models.

While o3 may restore some people’s faith in the progress of AI scaling regulations, OpenAI’s latest model also uses a previously unseen level of computation, which means a higher price per response.

“Perhaps the only essential caveat is to understand that one of the reasons O3 is so much better is that it costs more to run at inference time – the ability to apply computation at test time means that for some problems you can turn calculations into a better answer,” Clark writes on his blog. “This is interesting because it has made the costs of running AI systems less predictable – previously you could calculate how much it costs to run a generative model by simply looking at the model and the cost of generating a given output.”

Clark and others pointed to o3’s performance on the ARC-AGI benchmark — a complex test used to evaluate breakthroughs in AGI — as an indicator of its progress. It is worth noting that passing this test, according to its creators, does not constitute the AI ​​AGI model, but rather one way of measuring progress towards a nebulous goal. That said, the o3 model outperformed all previous AI models that passed the test, scoring 88% in one of its attempts. OpenAI’s next best AI model, o1, scored just 32%.

A graph showing the performance of the OpenAI series in the ARC-AGI test.Image credits:AR Award

However, the logarithmic x-axis in this graph may be disturbing to some. The high-scoring o3 version used over $1,000 worth of computing power for each task. The o1 models used about $5 of processing power per job, and the o1-mini models used just a few cents.

The creator of the ARC-AGI benchmark, François Chollet, writes in: blog that OpenAI used approximately 170 times more processing power to generate a score of 88% compared to the high-performance version o3, which scored only 12% lower. The high-scoring o3 version used over $10,000 in resources to complete the test, making it prohibitively exorbitant to compete for the ARC award – an unbeatable competition for AI models that beat the ARC test.

Chollet, however, argues that o3 was still a breakthrough in AI models.

“o3 is a system that can adapt to tasks never before encountered, possibly rivaling human performance in the ARC-AGI domain,” Chollet said in a blog post. “Of course, such generality comes at a high cost and would not yet be entirely economical: you could pay a human to solve ARC-AGI tasks for about $5 per task (we know, we did it), using just cents in energy.”

It’s too early to talk about exact pricing for all of this – we’ve seen AI model prices plummet in the last year, and OpenAI hasn’t yet announced how much o3 will actually cost. However, these prices indicate how much computing power is needed to even slightly break the performance barriers currently set by leading AI models.

This raises some questions. What exactly is o3 for? How much more computation is needed to get more benefit from reasoning with o4, o5, or whatever OpenAI calls its next reasoning models?

It doesn’t look like o3 or its successors are anyone’s “everyday driver” like GPT-4o or Google Search. These models simply apply too much processing power to answer miniature questions throughout the day like, “How can the Cleveland Browns still make the playoffs in 2024?”

Instead, it seems that AI models with scaled computations during testing may only be good for general suggestions such as: “How can the Cleveland Browns become a Super Bowl franchise in 2027?” Even then, it may only be worth the high computational costs if you’re the general manager of the Cleveland Browns and apply these tools to make essential decisions.

Institutions with deep pockets may be the only ones that can afford o3, at least to start with, as Wharton professor Ethan Mollick notes in his article tweet.

We’ve already seen OpenAI release a $200 tier to apply the high-computing o1 version, but the startup it was reportedly considering creating subscription plans costing up to $2,000. When you see how much processing power o3 uses, you can see why OpenAI would consider it.

However, there are disadvantages to using o3 for high-impact work. As Chollet notes, o3 is not AGI and still cannot handle some very straightforward tasks that a human could perform with ease.

This isn’t necessarily surprising, as immense language models still have a huge hallucination problem that hasn’t been addressed in o3 computation and testing time. That’s why ChatGPT and Gemini put disclaimers under every answer they provide, asking users not to trust the answers at face value. Arguably, AGI, if it were ever achieved, would not need such a caveat.

One way to unlock greater benefits in scaling testing time may be through better AI inference chips. There is no shortage of startups tackling this very problem, like Groq and Cerebras, while other startups are designing more profitable AI chips like MatX. Andreessen Horowitz general partner Anjney Midha previously told TechCrunch that he expects these startups to play a bigger role in test time scaling in the future.

While o3 represents a noticeable improvement in the performance of AI models, it raises several novel questions regarding usage and cost. That said, o3’s performance lends credence to the claim that test-time computation is the next best way for the tech industry to scale AI models.

Latest Posts

More News