Apple, Anthropic and Other Companies Used YouTube Videos to Train AI

Share

More than 170,000 YouTube videos are part of a massive dataset that has been used to train AI systems at some of the biggest tech companies, according to an investigation conducted by Evidence of News and co-published with Wire. Apple, Anthropic, Nvidia, and Salesforce are among the tech companies that have used “YouTube Subtitles” data, which was ripped from the video platform without permission. The training dataset is a collection of subtitles pulled from YouTube videos from over 48,000 channels — it does not include images from the videos.

The dataset features videos from popular creators like MrBeast and Marques Brownlee, as well as clips from news outlets like ABC News, BBC, and Recent York Times. Over 100 videos with Edge appear in the dataset along with many other videos from Voice.

“Apple has acquired data for its AI from several companies,” said Brownlee, known by his pseudonym MKBHD, wrote in a post on X“One of them pulled tons of data/transcripts from YouTube videos, including mine.” He added: “This will be an ongoing problem for a long time.”

YouTube did not respond immediately Edge‘S request for comment.

As part of the ongoing investigation Evidence of News also released interactive search tool. You can exploit the search feature to see if your content — or that of your favorite YouTuber — appears in the dataset.

The caption dataset is part of a larger collection of material from nonprofit EleutherAI called The Pile, an open-source collection that also includes datasets of books, Wikipedia articles, and more. Last year analysis of one dataset called Books3 revealed whose work was used to train AI systems, and the dataset was cited in authors’ lawsuits against companies that used it to train artificial intelligence.

AI companies are rarely forthcoming about the data that goes into their AI systems; exactly how YouTube content is used has been a key question in recent months. In March, when OpenAI unveiled its powerful video-generating tool Sora, CTO Mira Murati repeatedly dodged questions about whether the system was trained on YouTube videos.

“I won’t go into details about the data used, but it was publicly available or licensed data” she said Wall Street Journal then. After pressing Journal specifically about YouTube content, Murati said she “wasn’t sure about that.”

In previous interviews, YouTube CEO Neal Mohan has said that using video content to train AI — including transcription — would violate the platform’s terms. And in a May episode DecoderGoogle CEO Sundar Pichai agreed with Mohan’s assessment that if OpenAI had actually trained Sora on content available on YouTube, it would have violated YouTube’s terms.

“We have a policy and we expect people to follow it when we build a product, so that’s how I saw it,” Pichai said.

The AI Sckool

Categories

Apple, Anthropic and Other Companies Used YouTube Videos to Train AI

3 questions: Beyond data-driven aesthetics

Almost anyone can now sell you GLP-1 on the Internet

7 Real Python Projects You Can Build in 2026 (with Guides)

Start building with Nano Banana 2 Lite and Gemini Omni Flash

Trump Administration Lifts Export Controls on Anthropic’s Mythos and Fable AI Models

More News

What’s going on with Alexa+?

The winter storm tested power grids that are strained to accommodate AI data centers

Google DeepMind employees ask leaders to ensure their “physical safety” from ICE

Google Photos now lets you describe how to turn images into videos

3 questions: Beyond data-driven aesthetics

Almost anyone can now sell you GLP-1 on the Internet

7 Real Python Projects You Can Build in 2026 (with Guides)