Friday, January 10, 2025

Meta secretly trained its artificial intelligence on a notorious piracy database, recently unredacted court documents reveal

Share

“Meta treated the so-called ‘public availability’ of shadow datasets as a chance to get out of jail, even though Meta’s internal records show that every relevant decision maker at Meta, including its CEO, Mark Zuckerberg, knew that LibGen was a ‘dataset , which we know is pirated,” the plaintiffs claim in their motion. (The application, originally filed in overdue 2024, is a request to file a third amended complaint.)

In addition to the plaintiffs’ letters, another filing in response to Chhabria’s order was not redacted – Meta’s letter opposition to the application for submitting an amended complaint. It argues that the authors’ attempts to add additional claims to the case constitute an “eleven-hour game based on false and inflammatory premises” and denies that Meta waited to reveal key information during discovery. Instead, Meta claims that it first disclosed to the plaintiffs that it was using the LibGen dataset in July 2024. (Since much of the discovery material remains confidential, WIRED has difficulty corroborating this claim).

Meta’s argument rests on the claim that plaintiffs already knew about LibGen’s employ and should not be given additional time to file a third amended claim when they had ample time to do so before discovery concluded in December 2024. “Plaintiffs knew of Meta’s downloading and employ from LibGen and other alleged “shadow libraries” from at least mid-July 2024.” – lawyers of the technology giant to argue.

In November 2023, Chhabria granted Meta’s motion to dismiss certain claims in the lawsuit, including the claim that Meta’s alleged use of the authors’ works to train artificial intelligence violates the U.S. Digital Millennium Copyright Act, enacted in 1998 to stop people from selling or reproducing copyrighted works on the Internet. Meanwhile, the judge Agreement with Meta’s position that the plaintiffs did not present sufficient evidence that the company removed so-called “copyright management information” such as the author’s name and title of the work.

The unredacted documents argue that plaintiffs should be allowed to amend their complaint by maintaining that Meta’s disclosures constitute evidence that the DMCA claim was valid. They also claim that the discovery process revealed reasons to add new charges. “Meta, through a company representative who testified on November 20, 2024, has now admitted under oath to uploading (so-called ‘uploading’) pirated files containing Plaintiffs’ works on ‘torrent’ websites,” the motion states. (Uploading is when torrent files are then shared with other users once they are finished downloading.)

“The torrenting activity turned Meta itself into a distributor of the same pirated copyrighted material that it also downloaded for use in its commercially available artificial intelligence models,” states one of the newly unredacted documents, which alleges that Meta, in other words , I didn’t just use copyrighted material without permission, I also distributed it.

LibGen, an archive of books uploaded to the Internet that was established in Russia around 2008, is one of the largest and most controversial “shadow libraries” in the world. In 2015, a Modern York State judge ordered a preliminary injunction against the site, which in theory was intended to temporarily close the archive, but its anonymous administrators simply changed its domain. In September 2024, another judge from Modern York ordered LibGen will pay $30 million to rights holders for violating their copyrights, even though it doesn’t know who actually runs the piracy center.

Meta’s discovery problems in this case are not over either. In the same order, Chhabria warned the tech giant against any overly radical redaction requests in the future: “If Meta again submits an unreasonable, sweeping sealing request, all materials will simply be unsealed,” he wrote.

Latest Posts

More News