Harvard and Google Collaborate to Release 1 Million Public Domain Books for AI Training

December 13, 2024 – Harvard University and Google have announced, as reported by TechCrunch on the 12th, the joint release of a dataset containing 1 million public domain books for AI training purposes.

The cost of data required for AI training is often prohibitive, typically favoring well-funded technology companies. In response, Harvard University has plans to unveil a comprehensive dataset, encompassing approximately 1 million public domain books. This extensive collection spans a wide range of genres, languages, and authors, including timeless writers like Dickens, Dante, and Shakespeare, whose works have slipped into the public domain due to expired copyrights.

Although the specifics of the release, including the exact method and timing, remain unknown as the dataset is not yet publicly available, it has been revealed that the material originates from Google Books, a longstanding project of the tech giant. Consequently, Google will be involved in the widespread dissemination of this “valuable asset.”

It was previously reported in March of this year that Harvard had hinted at its Institutional Data Initiative (IDI), emphasizing its aim to furnish AI with a “trusted channel for legitimate data.” Only after its official kickoff did the program acknowledge financial backing from both Microsoft and OpenAI.

Greg Leppert, Executive Director of IDI, explained that the objective of this dataset is to “level the playing field” by granting access to this vast repository of books to a variety of organizations, including research institutes and AI startups, aiding them in training large language models.

Leave a Reply