OpenAI has faced condemnation in recent months from those who suggest it may be rushing too quickly and recklessly into developing more powerful AI. The company seems determined to show that it takes AI security seriously. Today, it unveiled research that it says could support researchers study AI models even as they become more powerful and useful.
The modern technique is one of several AI safety ideas the company has touted in recent weeks. It involves two AI models entering into a conversation that forces the more powerful one to be more lucid, or “readable,” in its reasoning so that humans can understand what it’s doing.
“This is the heart of the mission of building [artificial general intelligence] “It’s both safe and beneficial,” Yining Chen, an OpenAI researcher involved in the study, told WIRED.
So far, the work has been tested on an AI model designed to solve simple math problems. The OpenAI researchers asked the AI model to explain its reasoning when answering questions or solving problems. A second model is trained to detect whether the answers are correct or incorrect, and the researchers found that engaging in back-and-forth between the two models encouraged the math-solving model to be more direct and transparent in its reasoning.
OpenAI publicly releases a paper detailing the approach. “It’s part of a long-term security research plan,” says Jan Hendrik Kirchner, another OpenAI researcher involved in the work. “We hope that other researchers can follow suit and maybe try other algorithms as well.”
Transparency and explainability are key issues for AI researchers working to build more powerful systems. Large language models sometimes offer reasonable explanations for how they arrived at a conclusion, but a key concern is that future models could become more obscure or even misleading in the explanations they provide—possibly pursuing an undesirable goal while lying about it.
The research revealed today is part of a broader effort to understand how the large language models that are at the core of programs like ChatGPT work. It’s one of many techniques that could help make more efficient AI models more transparent and therefore safer. OpenAI and others are also exploring more mechanistic ways to look inside large language models.
OpenAI has been revealing more of its AI security work in recent weeks after criticism of its approach. In May, WIRED learned that a team of researchers dedicated to studying long-term AI risks had been disbanded. That happened shortly after the departure of co-founder and key technical leader Ilya Sutskever, who was among the board members who briefly ousted CEO Sam Altman last November.
OpenAI was founded on the promise of making AI more transparent to scrutiny and safer. After the runaway success of ChatGPT and the increased competition from well-supported rivals, some have accused the company of prioritizing flashy advances and market share over security.
Daniel Kokotajlo, a researcher who left OpenAI and signed an open letter criticizing the company’s approach to AI security, says the new work is important but incremental, and that it doesn’t change the fact that the companies building the technology need more oversight. “The situation we’re in remains the same,” he says. “Opaque, unaccountable, unregulated corporations are racing to build AI superintelligence, with essentially no plan for how to control it.”
Another source with knowledge of OpenAI’s internal operations, who asked not to be identified because they were not authorized to speak publicly, said there also needs to be outside oversight of AI companies. “The question is whether they take seriously the kinds of processes and governance mechanisms you need to prioritize societal benefits over profit,” the source said. “Not whether they allow some of their own researchers to do certain security-related activities.”
