Monday, December 23, 2024

Hey Alexa! I’m sorry I cheated on you…

Share

A person can probably tell the difference between a turtle and a rifle. Two years ago, Google’s AI wasn’t like this Bright. For a long time, part of computer science research has been devoted to better understanding how machine learning models deal with these “adversarial” attacks, which are inputs that are intentionally created to deceive or deceive machine learning algorithms.

Although most of this work has focused on speech AND picturesrecently, a team from MIT Laboratory of Computer Science and Artificial Intelligence (CSAIL) tested the limits of text. They have developed “TextFooler”, a generic framework that can effectively target natural language processing (NLP) systems – the types of systems that allow us to interact with our Siri and Alexa voice assistants – and “trick” them into making incorrect predictions.

You can imagine using TextFooler for many internet security applications, such as filtering email spam, flagging hate speech, or “sensitive” detection of texts containing political speech – all relying on text classification models.

“If these tools are vulnerable to an intentional adversarial attack, the consequences could be catastrophic,” says Di Jin, an MIT graduate student and lead author of the novel paper on TextFooler. “These tools must have an effective defense approach to protect themselves, and to create such a secure defense system, we must first explore adversarial methods.”

TextFooler consists of two parts: changing a given text, and then using that text to test two different language tasks and see if the system can successfully fool machine learning models.

The system first identifies the most crucial words that will influence the prediction of the target model, and then selects contextually appropriate synonyms. All this while maintaining the grammar and original meaning to appear sufficiently “human” until the prediction is changed.

The framework is then applied to two different tasks – text classification and implication (i.e., the relationship between text fragments in a sentence) to reclassify or invalidate the implication assessment of the original models.

In one example, the TextFooler input and output were as follows:

“Characters placed in impossible situations are completely detached from reality.”

“The characters, cast in impossible circumstances, are completely detached from reality.”

In this case, while testing the NLP model, the sample input data is correct but the modified input data is incorrect.

In total, TextFooler successfully attacked three target models, including “BERT”, a popular open-source NLP model. It deceived target models with an accuracy of over 90%. to below 20 percent, changing only 10 percent. words in a given text. The team assessed success based on three criteria: changes in the model’s classification predictions or consequences; did it look similar to a human reader compared to the original example; and whether the text looked natural enough.

The researchers note that while attacking existing models is not the ultimate goal, they hope this work will support more abstract models generalize to novel, unseen data.

“The system can be used or extended to attack any classification-based NLP models to check their robustness,” Jin says. “On the other hand, the generated adversaries can be used to improve the robustness and generalization of deep learning models through adversarial training, which is a key direction of this work.”

Jin wrote the paper with MIT professor Peter Szolovits, Zhijing Jin of the University of Hong Kong, and Joey Tianyi Zhou of A*STAR in Singapore. They will present a paper at the AAAI conference on artificial intelligence in Up-to-date York.

Latest Posts

More News