Harvard Is Releasing a Free AI Training Dataset
Harvard University is releasing a dataset of nearly 1 million public-domain books for AI training, funded by Microsoft and OpenAI, to promote equitable access amid ongoing legal challenges regarding copyrighted materials.
Read original articleHarvard University has announced the release of a substantial dataset comprising nearly 1 million public-domain books, aimed at facilitating the training of AI models. This initiative, led by Harvard's Institutional Data Initiative and funded by Microsoft and OpenAI, seeks to democratize access to high-quality training data, which is typically dominated by large tech companies. The dataset, significantly larger than the controversial Books3 dataset, includes a diverse range of literature from various genres and languages, featuring works from renowned authors like Shakespeare and Dickens. Greg Leppert, the project's director, emphasizes the importance of this resource in leveling the playing field for smaller AI developers and researchers. Microsoft and OpenAI have expressed their support for the project, highlighting the need for accessible data pools for AI startups. As legal challenges regarding the use of copyrighted materials for AI training continue, the Harvard dataset represents a proactive step towards ensuring that public domain resources are available for AI development. The Institutional Data Initiative is also collaborating with the Boston Public Library to digitize additional public domain articles. The dataset's release method is still under discussion, with Google being approached for assistance in distribution. This initiative aligns with a growing trend of creating public domain datasets, which could reshape the landscape of AI training and reduce reliance on copyrighted materials.
- Harvard is releasing a dataset of nearly 1 million public-domain books for AI training.
- The project is funded by Microsoft and OpenAI to promote equitable access to training data.
- The dataset is significantly larger than the controversial Books3 dataset.
- Legal challenges regarding AI training data usage are ongoing, making this release timely.
- The initiative includes collaboration with the Boston Public Library for additional public domain content.
Related
OpenAI pleads it can't make money with o using copyrighted material for free
OpenAI requests British Parliament to permit copyrighted material for AI training. Facing legal challenges from NYT and Authors Guild for alleged copyright infringement. Debate impacts AI development and copyright protection, raising concerns for content creators.
The Data That Powers A.I. Is Disappearing Fast
A study highlights a decline in available data for training A.I. models due to restrictions from web sources, affecting A.I. developers and companies like OpenAI, Google, and Meta. Challenges prompt exploration of new data access tools and alternative training methods.
NYT: The Data That Powers AI Is Disappearing Fast
A study highlights a decline in available data for training A.I. models due to restrictions from web sources, affecting A.I. developers and researchers. Companies explore partnerships and new tools amid data challenges.
Has your paper been used to train an AI model? Almost certainly
Academic publishers are selling research papers to AI firms, raising copyright concerns. Major deals include Taylor & Francis with Microsoft and Wiley with another company, prompting legal disputes and researcher frustrations.
OpenAI Pleads It Can't Make Money Without Using Copyrighted Materials for Free
OpenAI has requested permission from the British Parliament to use copyrighted materials for AI training, arguing it's essential for developing effective models, despite facing legal challenges and industry skepticism.
^ this is pretty cool and interesting. The collaboration they're doing with Boston Public Library to make articles similarly accessible also sounds pretty exciting.
More color from harvard
Related
OpenAI pleads it can't make money with o using copyrighted material for free
OpenAI requests British Parliament to permit copyrighted material for AI training. Facing legal challenges from NYT and Authors Guild for alleged copyright infringement. Debate impacts AI development and copyright protection, raising concerns for content creators.
The Data That Powers A.I. Is Disappearing Fast
A study highlights a decline in available data for training A.I. models due to restrictions from web sources, affecting A.I. developers and companies like OpenAI, Google, and Meta. Challenges prompt exploration of new data access tools and alternative training methods.
NYT: The Data That Powers AI Is Disappearing Fast
A study highlights a decline in available data for training A.I. models due to restrictions from web sources, affecting A.I. developers and researchers. Companies explore partnerships and new tools amid data challenges.
Has your paper been used to train an AI model? Almost certainly
Academic publishers are selling research papers to AI firms, raising copyright concerns. Major deals include Taylor & Francis with Microsoft and Wiley with another company, prompting legal disputes and researcher frustrations.
OpenAI Pleads It Can't Make Money Without Using Copyrighted Materials for Free
OpenAI has requested permission from the British Parliament to use copyrighted materials for AI training, arguing it's essential for developing effective models, despite facing legal challenges and industry skepticism.