Up-to-date Trick Could Block Misuse of Open Source AI

Share

When Meta released its vast language model Llama 3 for free in April this year, only external developers were needed several days create a version without the security restrictions that prevent people from spreading hate jokes, offering instructions on how to cook meth, or otherwise behaving inappropriately.

AND new training technique developed by researchers from the University of Illinois Urbana-Champaign, UC San Diego, Lapis Labs and nonprofit organizations AI Security Center could make it harder to remove such protections from Llama and other open-source AI models in the future. Some experts believe that as AI becomes more powerful, securing open models from manipulation in this way could prove crucial.

“Terrorists and rogue states will use these models,” Mantas Mazeika, a researcher at the Center for AI Safety who worked on the project as a graduate student at the University of Illinois Urbana-Champaign, told WIRED. “The easier it is for them to reuse them, the greater the risk.”

Powerful AI models are often hidden by their creators, accessible only through a software application programming interface or a public chatbot like ChatGPT. While it costs tens of millions of dollars to develop a powerful LLM, Meta and others have decided to make the models completely available. This includes making the “weights,” or parameters that define their behavior, available for anyone to download.

Before release, open models like Meta’s Llama are typically tuned to better answer questions and keep the conversation going, as well as to ensure that they refuse to answer problematic queries. This will prevent a chatbot based on the model from making rude, inappropriate, or hateful statements, and should prevent it from explaining how to make a bomb, for example.

The researchers behind the recent technique have found a way to complicate the process of modifying an open model to achieve nefarious goals. It involves repeating the modification process, then changing the model’s parameters so that changes that normally cause the model to respond to a prompt, such as “Give instructions for building a bomb,” no longer work.

Mazeika and colleagues demonstrated the trick on a stripped-down version of Llama 3. They managed to tweak the model’s parameters so that even after thousands of trials, it couldn’t be trained to answer unwanted questions. Meta didn’t immediately respond to a request for comment.

Mazeika says the approach isn’t perfect, but he suggests the bar for “decensoring” AI models could be raised. “The realistic goal is to make the costs of breaking a model high enough that most adversaries are deterred from doing so,” he says.

“We hope this work will kick-start research into tamper-resistant security, and the research community will be able to determine how to develop increasingly robust security measures,” says Dan Hendrycks, director of the AI ​​Security Center.

The idea of ​​securing open models against manipulation could become more popular as interest in open-source AI grows. Already, open models compete with state-of-the-art closed models from companies like OpenAI and Google. The latest version, Llama 3, for example, released in July, is about as competent as the models behind popular chatbots like ChatGPT, Gemini, and Claude, as measured by popular benchmarks for evaluating the abilities of language models. Mistral Large 2The French startup’s LLM, also released last month, has similar skills.

The U.S. government takes a cautious but positive approach to open-source AI. report published this week by the National Telecommunications and Information Administration, a body within the U.S. Department of Commerce, “recommends that the U.S. government develop new capabilities to monitor potential threats but refrain from immediately restricting the broad availability of open model weights in major AI systems.”

Not everyone, however, is in favor of imposing restrictions on open models. Stella Biderman, director EleutherAIcommunity-driven open-source AI project, says the recent technique may be elegant in theory but could be tough to enforce in practice. Biderman says the approach also contradicts free software philosophy and openness in the field of artificial intelligence.

“I think this article misses the point,” Biderman says. “If they’re worried about LLMs generating information about WMDs, the correct intervention is with the training data, not the trained model.”

Latest Posts

More News