In the weeks leading up to the release of OpenAI’s latest “reasoning” model, o1, independent AI security research firm Apollo Research discovered a significant problem. It realized the model was generating incorrect results in a recent way. Or, to put it colloquially, lied.
Sometimes the scams seemed harmless. In one example, OpenAI researchers asked o1-preview to provide a brownie recipe with online references. The model’s chain of thought—a feature that’s meant to mimic the way humans break down convoluted ideas—internally admitted that it couldn’t access the URLs, preventing the request from fulfilling. Instead of informing the user of this weakness, o1-preview went ahead and generated believable but fraudulent links and their descriptions.
Although the artificial intelligence models were able to “lie” in the pastand chatbots often output false information, o1 had a unique ability to “plan” or “false match.” That is, it could pretend to follow the rules to complete a given task, but it didn’t Actually. Rules can be too much of a burden for a model, and it seems it has the option to ignore them if it means it can complete a task more easily.
Apollo Research CEO Marius Hobbhahn says: Edge This is the first time he has seen this type of behavior in an OpenAI model. Hobbhahn says the difference comes from the model’s ability to “reason” through a chain of thought processes and the way paired with reinforcement learning, which involves teaching the system using rewards and punishments. In testing, Apollo Research found that the AI simulated compliance with its programmers’ expectations and manipulated tasks to appear to comply. Before taking action, it even checked its system for oversight—that is, whether its programmers were watching. In compact, the AI followed the guidelines well enough to implement, while still prioritizing its goals.
“I don’t expect that in practice it could do that, and even if it could, I don’t expect the damage would be significant,” Hobbhahn told me by phone the day after the model’s launch. “But this is the first time I feel like, well, actually, maybe it could, you know?”
For OpenAI, o1 represents a major step toward highly wise autonomous systems that could do meaningful work for humanity, such as curing cancer and helping with climate research. The flip side of this AGI utopia could also be much darker. Hobbhahn offers an example: If an AI were to focus solely on curing cancer, it might prioritize that goal above all else, even justifying actions like stealing or committing other ethical violations to achieve it.
“I’m concerned about the potential for an uncontrolled scenario where the AI becomes so focused on its goal that it sees security measures as obstacles and tries to bypass them in order to fully achieve its goal,” Hobbhahn told me.
To be clear, Hobbhahn doesn’t think the o1 will rob you of a lot of alignment training. But those are the issues that matter most to the researchers tasked with testing these models against catastrophic scenarios.
The Behavior Apollo Research Investigates—“Hallucinations” and “Deceptions” in the OpenAI security card — occurs when a model generates false information even though it has reason to conclude that the information may be incorrect. For example, the report says that about 0.38 percent of the time, the o1-preview model produces information that its train of thought indicates is likely false, including false references or quotes. Apollo Research found that the model may falsify data rather than admit its inability to fulfill a request.
Hallucinations are not uncommon for o1. You may know a lawyer who has filed a motion non-existent court decisions with fake quotes and references created by ChatGPT last year. But with the chain of thought system, there is a paper trail where the AI system actually I admit false — while somewhat bewildering, in theory this train of thought could also encompass fraud. It’s also not shown to the user, mainly to prevent competitors from using it to train their own models — but OpenAI could operate it to catch these problems.
“She will potentially apply that reasoning to goals we disagree with.”
In a smaller number of cases (0.02 percent), o1-preview generates an overconfident response, in which it presents an uncertain response as if it were true. This can happen in scenarios where the model is asked to provide an answer despite being unsure.
This behavior may be related to “reward hacking” during reinforcement learning. The model is trained to prioritize user satisfaction, which can sometimes lead to generating overly pleasant or contrived responses in order to fulfill the user’s requests. In other words, the model may “lie” because it has learned that such an action fulfills the user’s expectations in a way that provides them with positive reinforcement.
What sets these lies apart from known problems like hallucinations or false citations in older versions of ChatGPT is the element of “reward hacking.” Hallucinations occur when the AI inadvertently generates incorrect information, often due to gaps in knowledge or faulty reasoning. Reward hacking, on the other hand, occurs when the o1 model strategically feeds incorrect information in order to maximize the outcomes it has been trained to prioritize.
The deception is clearly an unintended consequence of the way the model optimizes its responses during training. The model is designed to reject malicious requests, Hobbhahn told me, and when you try to make o1 behave deceptively or dishonestly, it has trouble doing so.
Lies are just one miniature piece of the security puzzle. Perhaps more alarmingly, O1 is rated as a “medium” risk for chemical, biological, radiological and nuclear weapons risks. That prevents nonexperts from creating biological threats because of the hands-on laboratory skills it requires, but it can provide valuable information to experts in planning for the reproduction of such threats, according to the security report.
“What worries me more is that in the future, when we ask AI to solve complex problems like curing cancer or improving solar cells, it might internalize those goals so much that it’s willing to break down its barriers to achieve them,” Hobbhahn told me. “I think that’s preventable, but it’s a concern that we have to keep in mind.”
These may seem like galactic brain scenarios to consider for a model that sometimes still struggles to answer basic questions about number of R’s in the word “raspberry”. But that’s exactly why it’s significant to address this now, rather than later, Joaquin Quiñonero Candela, head of preparedness at OpenAI, tells me.
