Wednesday, January 15, 2025

Inside Meta’s Race to Beat OpenAI: “We Must Learn to Build Limits and Win This Race”

Share

A major copyright lawsuit against Meta has revealed a slew of internal communications about the company’s plans to develop its open-source artificial intelligence models, Llama, which include discussions about avoiding “media reports suggesting that we have used a data set we know is pirated “.

In an October 2023 email to Meta AI researcher Hugo Touvron, Ahmad Al-Dahle, Meta’s vice president of generative AI: wrote that the company’s goal “it has to be GPT4,” referring to OpenAI’s large-language model announced in March 2023. The meta had to “learn how to build boundaries and win this race,” Al-Dahle added. These plans apparently included book piracy site Library Genesis (LibGen) to train their artificial intelligence systems.

Some undated email from Sony Product Meta Director Theakanathsent to VP of AI Research Joelle Pineau was considering whether to operate LibGen solely internally for the benchmarks included in the blog post, or to create a model trained on the site. In the email, Theakanath writes that “GenAI has been approved to use LibGen for Llama3… with a number of agreed-upon mitigations” after the information was forwarded to “MZ” – presumably Meta CEO Mark Zuckerberg. As noted in the email, Theakanath believed that “Libgen is necessary to meet SOTA requirements [state-of-the-art] numbers,” adding: “OpenAI and Mistral are known to use the library in their models (through word of mouth).” Mistral and OpenAI have not said whether they operate LibGen. (Edge I contacted both for more information).

a:hover]:text-gray-63 [&>a:hover]:shadow-underline-black dark:[&>a:hover]:text-grey-bd dark:[&>a:hover]:shadow-highlight-gray [&>a]:shadow-highlight-gray-63 dark:[&>a]:text-grey-bd dark:[&>a]:shadow-underline-grey”>Screenshot: The Verge

The court documents come from a class-action lawsuit that author Richard Kadrey, comedian Sarah Silverman and others filed against Meta, accusing it of using illegally obtained copyrighted content to train artificial intelligence models in violation of intellectual property rights. Meta, like other artificial intelligence companies, argued that the operate of copyrighted material in training data should constitute legal fair operate. Edge reached out to Meta for comment but did not immediately receive a response.

Some of the “restrictions” on using LibGen included stipulations that Meta must “remove data clearly marked as pirated/stolen” while avoiding external citation of “use of any training data” from the site. Theakanath’s email also said the company would need to “put together a team” of its models “for biological weapons and CBRNE.” [Chemical, Biological, Radiological, Nuclear, and Explosives]”risk.

The email also discussed some of the “political risks” of using LibGen, including how regulators might respond to media reports suggesting Meta’s operate of pirated content. “This may undermine our negotiating position with regulators on these issues,” the email said. Interview from April 2023 between Meta researcher Nikolai Bashlykov and AI team member David Esiobu also showed Bashlykov admitting that he is “not sure if we can use meta IPs to load via torrents [of] pirated content.”

Other internal documents show what steps Meta took to hide copyright information in LibGen training data. The document, titled “Observations on LibGen-SciMag,” highlights comments left by staff on how to improve the dataset. One suggestion is to “remove more copyright headers and document identifiers,” which includes any lines that contain “ISBN,” “Copyright,” “All Rights Reserved,” or the copyright symbol. Other notes mention removing more metadata “to avoid potential legal complications,” as well as considering whether to remove the article’s author list “to reduce liability.”

a:hover]:text-gray-63 [&>a:hover]:shadow-underline-black obscure:[&>a:hover]:text-grey-bd obscure:[&>a:hover]:shadow-highlight-gray [&>a]:shadow-highlight-gray-63 obscure:[&>a]:text-grey-bd obscure:[&>a]:shadow-underline-gray”>Screenshot: The Verge

In June last year Fresh York Times reported about the mad Meta race following ChatGPT’s debut, revealing that the company had hit a wall: it had used up almost every available English-language book, article, and poem that could be found on the Internet. Desperate for more data, executives reportedly immediately discussed purchasing Simon & Schuster and considered hiring contractors in Africa to summarize the books without permission.

In the report, some executives justified their approach by pointing to OpenAI’s “market precedent” for using copyrighted works, while others argued that Google’s 2015 court victory establishing its right to scan books could provide legal protection. “The only thing stopping us from being as good as ChatGPT is literally the sheer volume of data,” one executive said at the meeting, per New York Times.

It has been reported that pioneering labs like OpenAI and Anthropic have hit a data wall, meaning they don’t have enough recent data to train their immense language models. Many leaders have denied this, OpenAI CEO Sam Altman he said clearly: “There is no wall.” OpenAI co-founder Ilya Sutskever, who left the company last May to start a pioneering new lab, was more direct about the data wall’s potential. Last month at a major artificial intelligence conference, Sutskever said: “We have reached peak data and there will be no more. We have to deal with the data we have. There is only one Internet.”

This scarcity of data has led to many strange recent ways of obtaining unique data. Bloomberg reported that pioneering labs like OpenAI and Google pay digital content creators $1 to $4 per minute for unused video through a third party for LLM training (both of these companies have competing AI video generation products).

As companies like Meta and OpenAI look to develop their AI systems as quickly as possible, things are sure to get a bit murky. Although a judge partially dismissed Kadrey and Silverman’s class action lawsuit last year, the evidence presented here could strengthen parts of their case as the case proceeds in court.

Latest Posts

More News