October 20th, 2024

Crap Data Everywhere

Gerry McGovern discusses the issue of "crap data," highlighting its environmental impact, organizational inefficiency, and the compromised quality of AI training data due to the accumulation of unnecessary information.

Read original articleLink Icon
Crap Data Everywhere

Gerry McGovern highlights the pervasive issue of "crap data," emphasizing its detrimental impact on the environment and organizational efficiency. He argues that the digital age has led to an explosion of unnecessary data, with trillions of photos and videos being stored, most of which will never be accessed again. McGovern cites statistics showing that a significant portion of organizational data is never utilized, with examples such as Kyndryl deleting 90% of its data after a cleanup and various organizations having vast amounts of web pages that receive little to no traffic. He points out that many organizations lack awareness of their data inventory, with a substantial amount of data stored on servers that management does not even know exists. The rise of cloud storage has exacerbated the problem, as the low cost of storage encourages the accumulation of unnecessary data rather than its management. McGovern warns that this "crap data" is what artificial intelligence is being trained on, raising concerns about the quality and reliability of AI outputs. He calls for a reevaluation of data management practices to mitigate environmental harm and improve organizational effectiveness.

- The digital age has led to an explosion of unnecessary data, harming the environment.

- A significant portion of organizational data is never accessed or utilized.

- Many organizations are unaware of the extent and location of their data.

- Cloud storage has made the accumulation of "crap data" more prevalent.

- The quality of AI training data is compromised by the prevalence of low-quality data.

Link Icon 3 comments
By @jmatthews - 6 months
I've read the couple of critical responses but on merit, the message is true. Anecdotally, my wife takes hundreds of photos every month that essentially no one will ever see again.

Android has a memories feature that serves them back up to us on occasion. This is a pattern writ large for huge swaths of data.

Differences in governance or allowable access leads to mass duplication and data rot on anything remotely dynamic.

By @chmaynard - 6 months
McGovern appears to be a pundit trying to make a nice living selling his critique of stupid data retention practices.
By @reneberlin - 6 months
From a business-perspective that yells for a deduper on cloudscale (or a self-deduping fs).