More than 170,000 YouTube videos are part of a massive dataset that has been used to train AI systems at some of the biggest tech companies, according to an investigation conducted by Evidence of News and co-published with Wire. Apple, Anthropic, Nvidia, and Salesforce are among the tech companies that have used “YouTube Subtitles” data, which was ripped from the video platform without permission. The training dataset is a collection of subtitles pulled from YouTube videos from over 48,000 channels — it does not include images from the videos.
The dataset features videos from popular creators like MrBeast and Marques Brownlee, as well as clips from news outlets like ABC News, BBC, and Recent York Times. Over 100 videos with Edge appear in the dataset along with many other videos from Voice.
“Apple has acquired data for its AI from several companies,” said Brownlee, known by his pseudonym MKBHD, wrote in a post on X“One of them pulled tons of data/transcripts from YouTube videos, including mine.” He added: “This will be an ongoing problem for a long time.”
YouTube did not respond immediately Edge‘S request for comment.
As part of the ongoing investigation Evidence of News also released interactive search tool. You can exploit the search feature to see if your content — or that of your favorite YouTuber — appears in the dataset.
The caption dataset is part of a larger collection of material from nonprofit EleutherAI called The Pile, an open-source collection that also includes datasets of books, Wikipedia articles, and more. Last year analysis of one dataset called Books3 revealed whose work was used to train AI systems, and the dataset was cited in authors’ lawsuits against companies that used it to train artificial intelligence.
AI companies are rarely forthcoming about the data that goes into their AI systems; exactly how YouTube content is used has been a key question in recent months. In March, when OpenAI unveiled its powerful video-generating tool Sora, CTO Mira Murati repeatedly dodged questions about whether the system was trained on YouTube videos.
“I won’t go into details about the data used, but it was publicly available or licensed data” she said Wall Street Journal then. After pressing Journal specifically about YouTube content, Murati said she “wasn’t sure about that.”
In previous interviews, YouTube CEO Neal Mohan has said that using video content to train AI — including transcription — would violate the platform’s terms. And in a May episode DecoderGoogle CEO Sundar Pichai agreed with Mohan’s assessment that if OpenAI had actually trained Sora on content available on YouTube, it would have violated YouTube’s terms.
“We have a policy and we expect people to follow it when we build a product, so that’s how I saw it,” Pichai said.
