AI tools secretly train on real images of children

Share

More than 170 photos and personal data of Brazilian children were taken from an open-source dataset without their knowledge or consent and used to train artificial intelligence, according to a security expert. new report published on Monday by Human Rights Watch.

According to the report, the images were pulled from content published as recently as 2023 and as far back as the mid-1990s, long before any internet user could have predicted that their content could be used to train artificial intelligence. Human Rights Watch says the children’s personal information and links to their photos were included in LAION-5B, a dataset that is a popular source of training data for artificial intelligence startups.

“Their privacy is violated in the first place when their photo is downloaded and placed in these datasets. AI tools are then trained on this data and can therefore create realistic images of children,” says Hye Jung Han, a child rights and technology researcher at Human Rights Watch and the researcher who found the images. “The technology has been developed in such a way that any child who has any photo or video of themselves online is now at risk because any malicious actor can take that photo and then use these tools to manipulate it in any way they want.”

LAION-5B is based on Common Crawl – a data repository created by web scraping and made available to researchers – and has been used to train several artificial intelligence models, including Stability AI’s Stable Diffusion image generation tool. The dataset, created by the German nonprofit LAION, is publicly available and currently contains more than 5.85 billion pairs of images and captions, according to its website.

The baby photos researchers discovered came from mothers’ blogs and other personal, mothering or parenting blogs, as well as photos from YouTube videos with low views, apparently uploaded to share with family and friends.

“Just looking at the context of where they were published, they had a level of anticipation and a certain degree of privacy,” Hye says. “Most of these photos could not be found online using reverse image searches.”

LAION spokesman Nate Tyler says the organization has already taken action. “LAION-5B was removed in response to a Stanford report that found links in the dataset pointing to illegal content on the public web,” he says, adding that the organization is currently working with the “Internet Watch Foundation, the Canadian Child Protection Center, Stanford and Human Rights Watch to remove all known references to illegal content.”

YouTube terms of service do not allow scraping except in certain circumstances; these cases appear to be in conflict with this policy. “We have made clear that unauthorized downloading of YouTube content is a violation of our Terms of Service,” says YouTube spokesman Jack Maon, “and we continue to take action against this type of abuse.”

In December, discovered by scientists from Stanford University that AI training data collected by LAION-5B contained child sexual abuse material. The problem of outright misinformation is growing even among students in American schools, where it is used to bully classmates, especially girls. Hye is concerned that beyond using children’s photos to generate CSAM, the database could reveal potentially sensitive information such as locations and medical data. In 2022, an artist living in the USA found its own image in the LAION datasetand realized it came from her private medical records.

Latest Posts

More News