January 10, 2024 – In a recent admission, OpenAI, the developer behind the renowned ChatGPT, acknowledged that the creation of AI tools like ChatGPT would not have been possible without copyright-protected material. According to a report by The Telegraph, this revelation was made in a document submitted by OpenAI to the UK House of Lords Select Committee on Communication and Digital Affairs, which is currently investigating large language models.
The profound capabilities of AI models such as ChatGPT and image generator DALL-E are primarily attributed to their training on vast amounts of content, some of which is scraped from publicly accessible online sources, not always with the permission of copyright holders (although OpenAI does obtain licenses for certain training content). This “free-spirited” approach to content scraping has long been a part of academic machine learning research, but as deep learning AI models move towards commercialization, this practice has come under increasing scrutiny.
In its submission to the House of Lords, OpenAI wrote, “Since copyright currently covers almost all forms of human expression, including blog posts, photos, forum posts, software code snippets, and government documents, it would be impossible to train today’s leading AI models without using copyright-protected content.”
Furthermore, OpenAI stated that limiting training data to public domain books and images from “a century ago” would not yield AI systems that are “capable of meeting the needs of contemporary citizens.”
Last December, The New York Times filed a lawsuit against OpenAI and its significant investor Microsoft, alleging that they had illegally used the newspaper’s content in their products without permission. On Monday, OpenAI responded to the lawsuit on its website, calling it meritless and reiterating its support for journalism and its collaborative relationships with news organizations.
OpenAI’s defense primarily relies on the legal principle of “fair use,” which allows for the limited use of copyright-protected content without the owner’s permission in certain circumstances. The company maintains that copyright law does not prohibit the use of such material for training AI models.
“The use of publicly available internet material to train AI models is fair use, supported by long-standing and widely accepted precedent,” OpenAI wrote in a blog post published on Monday. “We believe this principle is fair to creators, necessary for innovators, and crucial for America’s competitiveness.”
This is not the first time OpenAI has invoked the defense of fair use regarding its AI training data. Back in August of last year, in response to a copyright lawsuit filed by comedian Sarah Silverman, OpenAI defended its use of publicly available material on similar grounds. The company claimed that the comedian had a “misunderstanding” of copyright scope and had not considered limitations and exceptions such as “fair use,” which leave necessary room for cutting-edge AI innovations like large language models.