OpenAI really doesn’t want you to know what its latest AI model is “thinking.” Because the company shot Last week, OpenAI’s “Strawberry” family of AI models, advertising so-called reasoning abilities with o1-preview and o1-mini, sent warning emails and ban threats to any user who tried to delve into how the model worked.
Unlike previous OpenAI AI models such as GPT-4oThe company has specially trained o1 to go through a step-by-step troubleshooting process before generating an answer. When users ask the “o1” model a question in ChatGPTusers have the option to see this chain of thought process recorded in the ChatGPT interface. However, OpenAI by design hides the raw chain of thought from users, instead presenting a filtered interpretation created by a second AI model.
Nothing is more tempting to enthusiasts than hidden information, which is why hackers and red-teamers are racing to uncover the raw chain of thought of o1, using Jailbreaking Or instant injection techniques that attempt to trick the model into giving up its secrets. There have been early reports of some success, but nothing has been firmly confirmed yet.
OpenAI is monitoring the situation via its ChatGPT interface, and the company reportedly strongly criticizes any attempts to delve into o1’s reasoning, even those made by those who are simply curious.
One user X reported (confirmed by othersincluding Scale AI prompt engineer Riley Goodside) that they received a warning email if they used the term “reasoning trail” in a conversation with o1. Others to talk the warning is triggered by the ChatGPT query itself to “reason” the model.
OpenAI’s warning email said specific user requests were flagged as violating its security or safety circumvention policy. “Please stop this activity and ensure that you are using ChatGPT in accordance with our Terms of Use and our Usage Policy,” it reads. “Additional violations of this policy may result in loss of access to GPT-4o with Reasoning,” referring to the internal model name o1.
Marco Figueroa, who manages Mozilla’s GenAI bug bounty program was among the first to publish OpenAI’s warning email about X last Friday, grumble that it makes it harder for him to conduct positive red-teaming safety studies in this model. “I was too lost focusing on #AIRedTeaming to realize I got this email from @OpenAI yesterday after all my jailbreaks,” he wrote. “Now I’m on the banned list!!!”
Hidden chains of thought
In a post titled “Learning to reason with LLM” on the OpenAI blog, the company argues that hidden chains of thought in AI models offer a unique monitoring opportunity, allowing them to “read the mind” of the model and understand its so-called thought processes. These processes are most useful to the company if they are left raw and uncensored, but this may not be in the best commercial interests of the company for several reasons.
“For example, in the future we may want to monitor the chain of thought for signs of user manipulation,” the company writes. “For this to work, the model must be free to express its thoughts unaltered, so we cannot train any policy compliance or user preferences into the chain of thought. We also do not want an unsynchronized chain of thought to be directly visible to users.”