Danish media have demanded that the online archive of the nonprofit Common Crawl remove copies of their articles from previous datasets and immediately stop indexing their websites. The request comes amid growing outrage over the way artificial intelligence companies such as OpenAI utilize copyrighted material.
Common Crawl plans to comply with the request first issued on Monday. Executive director Prosperous Skrenta says the organization is “not prepared” to fight media companies and publishers in court.
The campaign was initiated by the Danish Rights Alliance (DRA), an association representing copyright owners in Denmark. She submitted the application on behalf of four media outlets, including Berlingske Media and the daily Jyllands-Posten. Recent York Times submitted a similar application Common Crawl last year, before filing a lawsuit against OpenAI for using its work without permission. In my complaintThe Recent York Times highlighted that the Common Crawl data was the “most weighted dataset” in GPT-3.
Thomas Heldrup, DRA’s head of content protection and enforcement, says the Times was the inspiration for this novel venture. “Common Crawl is unique in the sense that we see many large AI companies using their data,” Heldrup says. He sees his corps as a threat to media companies trying to negotiate with the titans of artificial intelligence.
While Common Crawl has played a key role in the development of many text-based generative AI tools, it was not designed with AI in mind. Founded in 2007, the San Francisco-based organization was best known before the AI boom for its value as a research tool. “Common Crawl is embroiled in a conflict over copyright and generative AI,” says Stefan Baack, a data scientist at the Mozilla Foundation, who recently published report on the role of Common Crawl in AI training. “For many years it was a small, niche project that almost no one knew about.”
Before 2023, Common Crawl did not receive a single request for data redaction. Now, in addition to requests from The Recent York Times and this group of Danish publishers, there is also an escalate in the number of requests that have not been made public.
In addition to this surge in data redaction requests, CCBot, the Common Crawl web crawler, is increasingly preventing the collection of novel data from publishers. According to artificial intelligence discovery startup Originality AI, which frequently tracks the utilize of web crawlers, more than 44 percent of the world’s leading news and media sites block CCBot. Apart from BuzzFeed, which began blocking it in 2018, most of the leading media outlets it analyzes – including Reuters, the Washington Post and the CBC – rejected the robot just last year. “They are getting blocked more and more,” Baack says.
Common Crawl’s quick response to these types of requests is driven by the realities of keeping a compact nonprofit afloat. However, compliance does not equal ideological agreement. Skrenta sees this attempt to remove archival material from data repositories like Common Crawl as simply an insult to the Internet as we know it. “It’s an existential threat,” he says. “They will kill the open web.”
