Tuesday, December 24, 2024

OpenAI o1 First Impressions: AI Designed to Overthink

Share

OpenAI released its modern o1 models on Thursday, giving ChatGPT users the first chance to try out AI models that pause to “think” before responding. There’s been a lot of buzz around these models, codenamed “Strawberry” by OpenAI. But does Strawberry live up to the hype?

Rather.

Compared to GPT-4o, the o1 models seem like one step forward and two steps back. OpenAI o1 is excellent at reasoning and answering complicated questions, but the model is about four times more steep to run than GPT-4o. OpenAI’s latest model lacks the tools, multimodal capabilities, and speed that made GPT-4o so impressive. In fact, OpenAI even admits that “GPT-4o is still the best option for most prompts” on the assist page and elsewhere notes that o1 has trouble with simpler tasks.

“It’s impressive, but I think the improvement is not very significant,” said Ravid Shwartz Ziv, an NYU professor who studies AI models. “It’s better at some problems, but it’s not that overall improvement.”

For all these reasons, it’s crucial to only utilize o1 for questions it’s really meant to assist with: massive ones. To be clear, most people aren’t using generative AI to answer these types of questions right now, mostly because today’s AI models aren’t very good at it. But o1 is a preliminary step in that direction.

Thinking about massive ideas

OpenAI o1 is unique in that it “thinks” before it answers, breaking down immense problems into miniature steps and trying to identify when one of those steps is right or wrong. This “multi-step reasoning” isn’t entirely modern (researchers have been proposing it for years, and You.com has been using it for complicated queries), but it hasn’t been practical until recently.

“There’s a lot of excitement in the AI ​​community,” Kian Katanforoosh, CEO of Worker and a Stanford University professor who teaches machine learning, said in an interview. “If you can train a reinforcement learning algorithm in conjunction with some of the language model techniques that OpenAI has, you can technically create step-by-step thinking and allow the AI ​​model to step back from the big ideas you’re trying to work on.”

OpenAI o1 is also extremely steep. In most models, you pay for input and output tokens. However, o1 adds a hidden process (miniature steps into which the model breaks down immense problems) that adds a lot of computation that you never fully see. OpenAI hides some of the details of this process to maintain a competitive advantage. However, you still pay for them in the form of “reasoning tokens.” This further highlights why you need to be careful when using OpenAI o1, lest you pay a ton of tokens for asking where the capital of Nevada is.

The idea of ​​an AI model that helps you “move away from big ideas” is powerful, though. In practice, the model is quite good at it.

In one example, I asked ChatGPT o1 preview for assist with planning a Thanksgiving for my family, a task that could utilize a bit of unbiased logic and reasoning. Specifically, I wanted assist figuring out whether two ovens would be enough to cook Thanksgiving dinner for 11 people, and I wanted to discuss whether we should consider renting an Airbnb to gain access to a third oven.

(Maxwell Zeff/OpenAI)
(Maxwell Zeff/OpenAI)

After 12 seconds of “thinking,” ChatGPT wrote me a 750+ word response in which he ultimately concluded that two ovens should be enough with a little careful strategizing and would allow my family to save money and spend more time together. But at each step, he broke down his thinking and explained how he took into account all of these external factors, including cost, family time, and oven management.

ChatGPT preview o1 gave me some advice on how to prioritize oven space in a house where the event was taking place, which was clever. Oddly enough, it suggested I consider renting a portable oven for the day. Still, the model worked much better than GPT-4o, which asked a lot of follow-up questions about exactly what cookware I was bringing and then gave me basic advice that I found less helpful.

Asking about Thanksgiving dinner may seem silly, but you can imagine how this tool can be helpful in breaking down complicated tasks into smaller pieces.

I also asked for assist planning a busy day at work where I had to travel between the airport, multiple in-person meetings in different locations, and my office. I was given a very detailed plan, but maybe a little too much. Sometimes all the extra steps can be a little overwhelming.

For a simpler question, o1 does way too much — it doesn’t know when to stop thinking too much. I asked where you can find cedar trees in America, and it responded in 800+ words, describing every variety of cedar tree in the country, including its scientific name. It even had to consult OpenAI policy at one point, for some reason. GPT-4o did a much better job of answering this question, giving me about three sentences explaining that you can find trees all over the country.

Lowering expectations

In a way, Strawberry was never meant to live up to expectations. Reports of OpenAI’s reasoning models date back to November 2023, right around the time everyone was looking for answers about why OpenAI’s board had ousted Sam Altman. That fired up the rumor mill in the AI ​​world, leaving some to speculate that Strawberry was a form of AGI, the enlightened version of AI that OpenAI ultimately wants to create.

Altman confirmed that o1 is not AGI will clear up any doubts, not that you will be confused after using this thing. The CEO also lowered expectations for this release, tweet that “the o1 still has flaws, is still limited, and still feels more impressive on first use than after spending more time with it.”

The rest of the artificial intelligence (AI) world is having to reconcile with a less exhilarating launch than expected.

“The hype around this topic has gotten out of OpenAI’s control,” said Rohan Pandey, a research engineer at AI startup ReWorkd, which builds tools for searching the web using OpenAI models.

He hopes that o1’s reasoning ability is good enough to solve a niche set of complicated problems where GPT-4 fails. That’s probably how most people in the industry see o1, but not quite the revolutionary leap forward that GPT-4 represented for the industry.

“Everyone is waiting for a step change in features for capabilities, and it’s not clear that this is what it means. I think it’s as simple as that,” Brightwave CEO Mike Conover, who previously co-created Databricks’ AI Dolly model, said in an interview.

What is the value here?

The basic principles used to create o1 go back decades. Google used similar techniques in 2016 to create AlphaGo, the first AI system to beat the world champion in the board game Go, notes former Googler and CEO of venture capital firm S32 Andy Harrison. AlphaGo trained by playing itself countless times, essentially teaching itself, until it reached superhuman capabilities.

He notes that this sparks a perennial debate in the world of artificial intelligence.

“Camp one believes that you can automate workflows with this agent-based process. Camp two believes that if you had generalized intelligence and reasoning, you wouldn’t need workflow and, like a human, the AI ​​would just make judgments,” Harrison said in an interview.

Harrison says he’s in camp one, and in camp two you have to trust the AI ​​to make the right decision. He doesn’t think we’re there yet.

Others, however, believe that o1 is not a decision-making tool, but rather a tool for challenging the way people think when making crucial decisions.

Katanforoosh, CEO of Worker, described an example where he had to interview a data scientist to work at his company. He tells OpenAI o1 that he only has 30 minutes and wants to assess a number of skills. He can work backwards with the AI ​​model to understand if it’s thinking about it correctly, and o1 will understand the time constraints and so on.

The question is whether this helpful tool is worth its high price tag. As AI models get cheaper, the o1 is one of the first AI models in a long time to become more steep.

Latest Posts

More News