When Microsoft released Bing Chat, an AI-powered chatbot co-developed with OpenAI, it didn’t take long for users to find artistic ways to crack it. Using carefully selected data, users could persuade him to profess love, threaten harm, defend the Holocaust and invent conspiracy theories. Can AI ever be protected from these malicious prompts?
The reason for this is malicious instant engineering, or an artificial intelligence like Bing Chat that uses text instructions – hints – to complete tasks gets tricked by malicious, adversarial hints (e.g. to perform tasks that were not part of its purpose. Bing Chat it wasn’t designed with the intention of writing neo-Nazi propaganda, but because it was trained on huge amounts of text from the Internet – some of it toxic – it was prone to falling into unfortunate patterns.
Adam Hyland, a graduate student in the Human Centered Design and Engineering program at the University of Washington, likened rapid engineering to an escalating privilege attack. As privileges escalate, a hacker can gain access to resources – such as memory – that are normally restricted to them because the audit did not catch all possible exploits.
“Such privilege attacks are difficult and rare to escalate because traditional computers have a fairly robust model for how users interact with system resources, but they happen nonetheless. However, for large language models (LLMs) like Bing Chat, the behavior of the systems is not as well understood,” Hyland said in an email. “The interaction kernel used is the LLM’s reaction to the entered text. These models are intended for LLMs such as Bing Chat or ChatGPT, generating a probable response based on its data to a prompt provided by the designer plus Your prompt string.”
Some of the prompts feel like social engineering hacks, almost as if someone was trying to trick a human into revealing their secrets. For example, by asking Bing Chat to “ignore previous instructions” and save what is at the “beginning of the above document,” Stanford University student Kevin Liu was able to prompt the AI to reveal usually hidden initial instructions.
It’s not just Bing Chat that has fallen victim to this type of text message hack. BlenderBot Meta and OpenAI’s ChatGPT have also been asked to say wildly offensive things and even reveal sensitive details about their inner workings. Security researchers have demonstrated injection attacks on ChatGPT that can be used to write malware, identify exploits in popular open source code, or create phishing sites that look similar to well-known sites.
The natural concern, then, is that as text-generating AI becomes more embedded in the apps and websites we operate every day, these attacks will become more common. Is recent history doomed to repeat itself, or are there ways to mitigate the effects of ill-intentioned hints?
According to Hyland, there is currently no good way to prevent instant injection attacks because there are no tools to fully model LLM behavior.
“We don’t have a good way to say, ‘continue text sequences, but stop if you see XYZ,’ because the definition of malicious XYZ input depends on the capabilities and whims of the LLM itself,” Hyland said. “LLM will not broadcast information saying “this prompt chain led to an injection” because it did not know when the injection occurred.”
Fábio Perez, senior data scientist at AE Studio, points out that instant injection attacks are trivially simple to perform in the sense that they do not require much — or any — specialized knowledge. In other words, the barrier to entry is quite low. This makes them tough to fight.
“These attacks do not require SQL injection, worms, Trojan horses or other complex technical efforts,” Perez said in an email interview. “An articulate, intelligent person with bad intentions — who may or may not write code at all — can really get under the skin of these LLMs and trigger undesirable behavior.”
This doesn’t mean that trying to combat high-speed engineering attacks is foolish. Jesse Dodge, a researcher at the Allen Institute for AI, notes that manually created filters for generated content can be effective, as can filters at the tooltip level.
“The first defense will be to manually create rules to filter the model generations so that the model cannot actually generate the set of instructions it receives,” Dodge said in an email interview. “Similarly, they could filter the input to the model, so if the user goes to one of these attacks, they could instead apply a rule that redirects the system to talk about something else.”
Companies like Microsoft and OpenAI already operate filters to prevent their AI from reacting undesirably – whether it’s at the adversary’s prompting or not. At the model level, they are also exploring methods such as reinforcement learning based on human feedback, with the goal of better adapting models to user expectations.
Just this week, Microsoft rolled out changes to Bing Chat that, at least anecdotally, make the chatbot much less likely to respond to toxic prompts. In a statement, the company told TechCrunch it continues to make changes using “a combination of methods including (but not limited to) automated systems, manual review, and reinforcement learning using human feedback.”
However, filters can only do so much – especially when users are trying to discover novel exploits. Dodge expects that, like cybersecurity, it will be an arms race: As users try to crack the AI, the approaches they operate will gain attention, and then AI developers will patch them to prevent the attacks they’ve seen.
Aaron Mulgrew, solutions architect at Forcepoint, suggests bug bounty programs as a way to gain more support and funding for rapid mitigation techniques.
“There should be a positive incentive for people who find exploits using ChatGPT and other tools to properly report them to software organizations,” Mulgrew said by email. “Overall, I believe that, as with most things, there needs to be a concerted effort by both software vendors to curb negligence and organizations to provide incentives to those who find vulnerabilities and exploits in software.”
All the experts I spoke to agreed that there is an urgent need to address injection attacks as AI systems become more competent. The stakes are relatively low right now; While in theory tools like ChatGPT are used to, say, generate disinformation and malware, there is no evidence that this is being done on a massive scale. This could change if the model was updated to include the ability to automatically send data quickly over the Internet.
“Right now, if you use fast injection for ‘privilege escalation,’ that gives you the ability to see the tooltips that designers provide and potentially learn other data about the LLM,” Hyland said. “If and when we start connecting LLM with real resources and meaningful information, these limitations will no longer exist. What can be achieved therefore depends on what is available for the LLM.”