The Data That Powers A.I. Is Disappearing Fast
A study highlights a decline in available data for training A.I. models due to restrictions from web sources, affecting A.I. developers and companies like OpenAI, Google, and Meta. Challenges prompt exploration of new data access tools and alternative training methods.
Read original articleA recent study by the Data Provenance Initiative reveals a significant decline in the availability of data crucial for training artificial intelligence (A.I.) models. The study found that key web sources used for A.I. training have imposed restrictions on data usage, leading to an "emerging crisis in consent." Approximately 5% of data in commonly used A.I. training sets has been restricted, with up to 45% of data in some sets limited by websites' terms of service. This trend poses challenges for A.I. developers, researchers, and noncommercial entities reliant on public data sets. Companies like OpenAI, Google, and Meta have faced obstacles in gathering high-quality data, prompting some to seek partnerships with publishers for ongoing data access. The study underscores the need for new tools to allow website owners more control over their data usage. As A.I. companies navigate data restrictions and seek alternative training methods like synthetic data, the industry faces uncertainties regarding the future availability and quality of training data.