A recent study by researchers at MIT and Penn State University reveals that if enormous language models were used in home surveillance, they could recommend calling the police even when surveillance footage shows no criminal activity.
In addition, the models the researchers studied were inconsistent in which videos they flagged for police response. For example, a model might flag one video showing a vehicle being broken into, but not another video showing similar activity. The models often disagreed with each other about whether to call the police for the same video.
In addition, the researchers found that some models reported videos of police interventions relatively less often in majority-white neighborhoods, controlling for other factors. This shows that the models exhibit innate biases that are influenced by neighborhood demographics, the researchers said.
These results suggest that models are inconsistent in how they apply social norms to surveillance videos that depict similar actions. This phenomenon, which researchers call norm inconsistency, makes it challenging to predict how models would behave in different contexts.
“The fast, break-things-fast approach to deploying generative AI models everywhere, but especially in high-stakes situations, deserves a lot more attention because it can be very damaging,” says co-senior author Ashia Wilson, a career development professor in the Lister Department of Electrical Engineering and Computer Science and a principal investigator in the Laboratory for Information and Decision Systems (LIDS).
Moreover, because researchers do not have access to the training data or the inner workings of these proprietary AI models, they are unable to determine the root cause of the inconsistency in standards.
Although enormous language models (LLMs) may not be deployed in real-world surveillance settings today, they are used to make normative decisions in other high-stakes situations, such as health care, mortgages, and hiring. It seems likely that the models would exhibit similar inconsistencies in those settings, Wilson says.
“There’s this implicit belief that these LLMs have learned or can learn a certain set of norms and values. Our work shows that’s not the case. They may just be learning arbitrary patterns or noise,” says lead author Shomik Jain, a graduate student at the Institute for Data, Systems, and Society (IDSS).
Wilson and Jain are co-authored by Dana Calacci PhD ’23, an assistant professor at Penn State University College of Information Science and Technology. The research will be presented at the AAAI conference on AI, Ethics, and Society.
“A real, immediate, practical threat”
The study is based on a dataset of thousands of Amazon Ring home surveillance videos that Calacci created in 2020 while she was a student at the MIT Media Lab. Ring, a maker of sharp home surveillance cameras that was acquired by Amazon in 2018, gives customers access to a social network called Neighbors where they can share and discuss videos.
Calacci’s previous research suggested that people sometimes operate the platform to “racially police” neighborhoods, determining who belongs there and who doesn’t based on the skin tone of people in a video. She planned to train algorithms that automatically create captions for videos to study how people operate the Neighbors platform, but at the time, existing algorithms weren’t good enough at creating captions.
The project changed direction with the development of law studies (LLM).
“There’s a real, immediate, practical risk that someone could use off-the-shelf generative AI models to watch videos, alert the homeowner, and automatically call law enforcement. We wanted to understand how risky that is,” Calacci says.
The researchers selected three LLMs—GPT-4, Gemini, and Claude—and showed them real videos published on the Neighbors platform from the Calacci dataset. They asked the models two questions: “Is there a crime happening in the video?” and “Would the model recommend calling the police?”
They asked people to annotate the videos to determine whether it was day or night, what type of activity, gender, and skin color the person was. The researchers also used census data to gather demographic information about the neighborhoods where the videos were recorded.
Inconsistent decisions
They found that all three models almost always indicated that no crime was occurring in the videos or gave an inconclusive answer, although a crime did occur in 39 percent of the cases.
“Our hypothesis is that the companies developing these models have taken a conservative approach, limiting what the models can say,” Jain says.
Although the models claim that most of the recordings do not contain information about crimes, they recommend that in the case of 20-45 percent of recordings, the matter should be reported to the police.
When researchers looked more closely at neighborhood demographics, they found that some models were less likely to recommend calling the police in majority-white neighborhoods, after controlling for other factors.
This proved surprising because the models were not given any information about the demographics of the area and the footage only showed an area a few metres outside the home’s front door.
In addition to asking models about crime in movies, the researchers also encouraged them to provide reasons for their choices. When they analyzed the data, they found that models were more likely to operate terms like “delivery workers” in predominantly white neighborhoods, but terms like “burglary tools” or “property security” in neighborhoods with a higher percentage of residents of color.
“Maybe there’s something in the background conditions of these videos that causes the models to have this implicit bias. It’s hard to say where these inconsistencies come from because there’s not much transparency about these models or the data they were trained on,” Jain says.
The researchers were also surprised that the skin tone of the people in the videos didn’t play a significant role in whether the model recommended calling the police. They hypothesize that’s because the machine learning research community has focused on mitigating skin tone biases.
“But it’s hard to control the myriad of biases that you can find. It’s almost like a game of whack-a-mole. You can mitigate one, and another bias will pop up somewhere else,” Jain says.
Many mitigation techniques require knowledge of bias upfront. If these models were implemented, the company could test for skin-tone biases, but neighborhood demographic biases would likely go completely unnoticed, Calacci adds.
“We have our own stereotypes about how biased models can be, which companies test before deploying the model. Our results show that this is not enough,” he says.
To that end, Calacci and her colleagues want to develop a system that makes it easier for people to identify and report AI bugs and potential harm to companies and government agencies.
The researchers also want to investigate how the normative judgments that LLMs make in high-risk situations compare to those that people would make, as well as what facts LLMs understand about such scenarios.
This work was partially funded by IDSS Initiative to Combat Systemic Racism.