Archivists work to save disappearing Data.gov datasets
Since Donald Trump's inauguration, over 2,000 datasets have been removed from Data.gov, raising concerns about data loss, particularly related to climate change, and prompting calls for better preservation systems.
Read original articleSince Donald Trump's inauguration, over 2,000 datasets have been removed from Data.gov, the primary repository for U.S. government open data. This decline has raised concerns among archivists and researchers about the potential loss of critical information, particularly related to climate change and diversity. The deletions appear to have occurred shortly after the inauguration, with many datasets linked to agencies like the Department of Energy and NOAA. However, it remains unclear whether the data has been permanently deleted or simply relocated to other government websites. Researchers, including Jack Cushman from Harvard, are working to track these changes and determine the status of the missing datasets. The complexity arises from Data.gov's role as an aggregator, which means it does not always host the data itself, complicating archiving efforts. Some datasets have been found on agency websites, while others seem to have vanished entirely. The situation is further complicated by the Trump administration's policies, which have been criticized for targeting climate and equity-related data. Archivists emphasize the need for a more robust system to preserve government data, as the current reliance on a single aggregator like Data.gov poses risks for data accessibility and preservation.
- Over 2,000 datasets have disappeared from Data.gov since Trump's inauguration.
- Many deletions are linked to agencies focused on climate and environmental data.
- The status of missing datasets is unclear; some may have been relocated rather than deleted.
- The Trump administration's policies have been criticized for targeting specific types of data.
- Archivists call for improved systems to preserve government data beyond a single aggregator.
Related
PSA: Internet Archive "glitch" deletes years of user data and accounts
A glitch at the Internet Archive deleted numerous user accounts and data, affecting many users. The organization has not addressed the issue, leading to frustration and concerns about data reliability.
PSA: Internet Archive "glitch" deletes years of user data and accounts
The Internet Archive suffered a significant data loss due to a glitch, deleting numerous user accounts and data, causing frustration among users and raising concerns about the platform's reliability.
List of Government Data Sites
Governments globally publish open data through official websites, promoting transparency. The article lists various data portals, including those from supranational organizations, and encourages contributions for completeness.
Crap Data Everywhere
Gerry McGovern discusses the issue of "crap data," highlighting its environmental impact, organizational inefficiency, and the compromised quality of AI training data due to the accumulation of unnecessary information.
NASA moves swiftly to end DEI programs, ask employees to "report" violations
NASA is terminating DEIA programs following Trump's executive orders, urging employees to report related activities. Other federal agencies are also dismantling similar initiatives, potentially disrupting critical research and public health efforts.
- Many commenters emphasize the importance of preserving government data, particularly related to climate change and transparency.
- There are calls for better archiving methods and tools to track what datasets are being removed or altered.
- Some users express skepticism about the motivations behind the data removal, linking it to broader political agendas.
- Several participants discuss the role of volunteers and community efforts in archiving data to ensure its availability.
- Concerns are raised about the integrity and reliability of archived data, especially in the context of potential future manipulation.
There's lots of people making copies of things right now, which is great -- Lots Of Copies Keeps Stuff Safe. It's your data, why not have a copy?
One thing I think we can contribute here as an institution is timestamping and provenance. Our copy of data.gov is made with https://github.com/harvard-lil/bag-nabit , which extends BagIt format to sign archives with email/domain/document certificates. That way (once we have a public endpoint) you can make your own copy with rclone, pass it around, but still verify it hasn't been modified since we made it.
Some open questions we'd love help on --
* One is that it's hard to tell what's disappearing and what's just moving. If you do a raw comparison of snapshots, there's things like 2011-glass-buttes-exploration-and-drilling-535cf being replaced by 2011-glass-buttes-exploration-and-drilling-236cf, but it's still exactly the same data; it's a rename rather than a delete and add. We need some data munging to work out what's actually changing.
* Another is how to find the most valuable things to preserve that aren't directly linked from the catalog. If a data.gov entry links to a csv, we have it. If it links to an html landing page, we have the landing page. It would be great to do some analysis to figure out the most valuable stuff behind the landing pages.
Some below of contractors that work with US AID:
- https://www.edu-links.org/ (taken down)
- https://www.genderlinks.org/ (taken down)
- https://usaidlearninglab.org/ (taken down)
- https://agrilinks.org/ (presumably at risk)
- https://www.climatelinks.org/ (presumably at risk)
- https://biodiversitylinks.org/ (presumably at risk)
Was there an EO targeting these areas?
CDC data are disappearing - https://news.ycombinator.com/item?id=42897696 - Feb 2025 (216 comments)
Say you have a few TBs of disk space, and you're willing to capture some public datasets (or parts of them) that interest you, and publish them in a friendly jurisdiction - keyed by their MD5/SHA1 - or make them available upon request. I.e. be part of a large open-source storage network, but only for objects/datasets you're willing to store (so there are no illegal shenanigans).
Is this a use case for Torrents? What's the most suitable architecture available today for this?
Not to say the data isn't incredibly valuable and should be preserved for many other reasons of course. All the best to anyone archiving this, this is important work.
The government information crisis is bigger than you think it is - https://news.ycombinator.com/item?id=42895331
As one of the reddit comments (in the thread linked by the article) pointed out,
> During the start of Biden’s term, On 6th feb data.gov had “218,384 DATASETS” but on 7th feb it only had “192,180 DATASETS”
How much of this sort of effort results in that data being used? Are there success stories for these datasets being discoverable enough and useful to others?
Removing and altering of the information and data is one of the fundamental threats in our digital world.
It probably makes the most sense to do this on a daily basis. If something new, if published, grab it as soon as possible.
Data can also be redacted or altered for a variety of reasons, being able to see the before and after states can be illuminating.
Something I feel is missing here are statistics for each administration.
Does this only happen under a Trump administration, or does it happen to smaller or larger extent under other administrations?
I don't know how far back this federal goes so it might not be easy,
- Borges
> For example, in the days after Joe Biden was inaugurated, data.gov showed about 1,000 datasets being deleted as compared to a day before his inauguration
It's almost like this stuff happens regularly. If <insert Dem savior> wins in 2028, tons of government websites will also change in the first couple weeks of their presidency. Is it because they're a fascist dictator? Or is it because those websites reflect the administration's viewpoints on issues?
Wish people would take a deep breath and step back and think a little more. I despise Trump, but there's crying wolf and then there's the current state of media and online discourse. Trump thrives in this type of environment. He purposefully fosters it. Playing gotcha with him doesn't work because he doesn't care.
Related
PSA: Internet Archive "glitch" deletes years of user data and accounts
A glitch at the Internet Archive deleted numerous user accounts and data, affecting many users. The organization has not addressed the issue, leading to frustration and concerns about data reliability.
PSA: Internet Archive "glitch" deletes years of user data and accounts
The Internet Archive suffered a significant data loss due to a glitch, deleting numerous user accounts and data, causing frustration among users and raising concerns about the platform's reliability.
List of Government Data Sites
Governments globally publish open data through official websites, promoting transparency. The article lists various data portals, including those from supranational organizations, and encourages contributions for completeness.
Crap Data Everywhere
Gerry McGovern discusses the issue of "crap data," highlighting its environmental impact, organizational inefficiency, and the compromised quality of AI training data due to the accumulation of unnecessary information.
NASA moves swiftly to end DEI programs, ask employees to "report" violations
NASA is terminating DEIA programs following Trump's executive orders, urging employees to report related activities. Other federal agencies are also dismantling similar initiatives, potentially disrupting critical research and public health efforts.