Harvard University announced Thursday that it is making available a high-quality dataset of nearly one million public domain books that anyone can utilize to train vast language models and other artificial intelligence tools. The dataset was created by Harvard’s newly established Institutional Data Initiative with financial support from both Microsoft and OpenAI. Contains books scanned by the Google Books project that are no longer copyrighted.
About five times the size of the notorious Books3 dataset that was used to train artificial intelligence models such as Lama Meta, the Institutional Data Initiative database covers genres, decades and languages, including classics from Shakespeare, Charles Dickens and Dante alongside little famed Czech mathematics textbooks and Welsh pocket dictionaries. Greg Leppert, executive director of the Institutional Data Initiative, says the project is an attempt to “level the playing field” by giving the general public, including tiny AI industry players and individual researchers, access to the kinds of highly sophisticated and curated content repositories that typically can only recognized technology giants collect. “It went through rigorous analysis,” he says.
Leppert believes the novel public domain database could be used in conjunction with other licensed materials to create artificial intelligence models. “I think of it a bit like Linux becoming the primary operating system for much of the world,” he says, noting that companies would still need to utilize additional training data to differentiate their models from those of competitors.
Burton Davis, Microsoft’s vice president and deputy general counsel for intellectual property, emphasized that the company’s support for the project is in line with the assumptions broader beliefs about the value of creatingpools of available data” that AI startups can tap into and that are “managed in the public interest.” In other words, Microsoft doesn’t necessarily plan to replace all the AI training data it used in its own models with public domain alternatives like the books in Harvard’s novel database. “We use publicly available data to train our models,” Davis says.
As dozens of lawsuits filed over the utilize of copyrighted data to train AI wind up in the courts, the future of how AI tools are created hangs in the balance. If AI companies win their cases, they will be able to continue browsing the Internet without having to enter into licensing agreements with copyright owners. But if they lose, AI companies may be forced to change the way they create their models. The wave of projects like Harvard’s database is proceeding on the assumption that no matter what happens, there will be an appetite for public domain datasets.
In addition to the book collection, the Institutional Data Initiative is also working with the Boston Public Library to scan millions of newspaper articles currently in the public domain and says it is open to similar collaborations in the future. The exact method of making the book dataset available is not yet determined. The Institutional Data Initiative has asked Google to collaborate on public distribution, but the search giant has not yet publicly agreed to host it, although Harvard says it will. (Google did not respond to WIRED’s requests for comment.)