**November 10, 2023 –** OpenAI has recently announced a groundbreaking initiative to collaborate with organizations in generating public and private datasets for training AI models. This data partnership aims to “enable more organizations to contribute to shaping the future of AI” and to “benefit from more impactful models.”
According to a blog post by OpenAI, the organization emphasizes the importance of training AI models with extensive datasets to ultimately enhance their safety and benefit humanity as a whole. OpenAI envisions AI models that possess a profound understanding of a wide range of topics, industries, cultures, and languages, requiring diverse and comprehensive training datasets.
As part of the Data Partner Program, OpenAI plans to curate “large-scale” datasets that reflect various aspects of human society and are currently not easily accessible online. While the initiative encompasses multiple modalities such as images, audio, and video, OpenAI specifically seeks data that articulates human intent across different languages, subjects, and formats, including long-form writing and dialogues.
OpenAI assures that, if necessary, it will collaborate with organizations to digitize training data using optical character recognition and automatic speech recognition tools, ensuring the removal of sensitive or personal information as needed.
The organization aims to develop two types of datasets: a publicly available open-source dataset that anyone can use for AI model training and a set of private datasets intended for training proprietary AI models.
OpenAI states that private datasets are suitable for organizations wishing to maintain data confidentiality but seeking improved understanding within their specific domains. To date, OpenAI has collaborated with the Icelandic government and Miðeind ehf to enhance GPT-4’s proficiency in the Icelandic language. Additionally, partnerships with the Free Law Project have contributed to improving the model’s comprehension of legal documents. This collaborative effort marks a significant stride towards fostering a more inclusive and knowledgeable landscape for AI development.