January 30th, 2025

Archivists work to save disappearing Data.gov datasets

Since Donald Trump's inauguration, over 2,000 datasets have been removed from Data.gov, raising concerns about data loss, particularly related to climate change, and prompting calls for better preservation systems.

Read original articleLink Icon
ConcernFrustrationUrgency
Archivists work to save disappearing Data.gov datasets

Since Donald Trump's inauguration, over 2,000 datasets have been removed from Data.gov, the primary repository for U.S. government open data. This decline has raised concerns among archivists and researchers about the potential loss of critical information, particularly related to climate change and diversity. The deletions appear to have occurred shortly after the inauguration, with many datasets linked to agencies like the Department of Energy and NOAA. However, it remains unclear whether the data has been permanently deleted or simply relocated to other government websites. Researchers, including Jack Cushman from Harvard, are working to track these changes and determine the status of the missing datasets. The complexity arises from Data.gov's role as an aggregator, which means it does not always host the data itself, complicating archiving efforts. Some datasets have been found on agency websites, while others seem to have vanished entirely. The situation is further complicated by the Trump administration's policies, which have been criticized for targeting climate and equity-related data. Archivists emphasize the need for a more robust system to preserve government data, as the current reliance on a single aggregator like Data.gov poses risks for data accessibility and preservation.

- Over 2,000 datasets have disappeared from Data.gov since Trump's inauguration.

- Many deletions are linked to agencies focused on climate and environmental data.

- The status of missing datasets is unclear; some may have been relocated rather than deleted.

- The Trump administration's policies have been criticized for targeting specific types of data.

- Archivists call for improved systems to preserve government data beyond a single aggregator.

AI: What people are saying
The discussion surrounding the removal of datasets from Data.gov raises several important concerns and themes.
  • Many commenters emphasize the importance of preserving government data, particularly related to climate change and transparency.
  • There are calls for better archiving methods and tools to track what datasets are being removed or altered.
  • Some users express skepticism about the motivations behind the data removal, linking it to broader political agendas.
  • Several participants discuss the role of volunteers and community efforts in archiving data to ensure its availability.
  • Concerns are raised about the integrity and reliability of archived data, especially in the context of potential future manipulation.
Link Icon 29 comments
By @JackC - 2 months
I'm quoted in this article. Happy to discuss what we're working on at the Library Innovation Lab if anyone has questions.

There's lots of people making copies of things right now, which is great -- Lots Of Copies Keeps Stuff Safe. It's your data, why not have a copy?

One thing I think we can contribute here as an institution is timestamping and provenance. Our copy of data.gov is made with https://github.com/harvard-lil/bag-nabit , which extends BagIt format to sign archives with email/domain/document certificates. That way (once we have a public endpoint) you can make your own copy with rclone, pass it around, but still verify it hasn't been modified since we made it.

Some open questions we'd love help on --

* One is that it's hard to tell what's disappearing and what's just moving. If you do a raw comparison of snapshots, there's things like 2011-glass-buttes-exploration-and-drilling-535cf being replaced by 2011-glass-buttes-exploration-and-drilling-236cf, but it's still exactly the same data; it's a rename rather than a delete and add. We need some data munging to work out what's actually changing.

* Another is how to find the most valuable things to preserve that aren't directly linked from the catalog. If a data.gov entry links to a csv, we have it. If it links to an html landing page, we have the landing page. It would be great to do some analysis to figure out the most valuable stuff behind the landing pages.

By @0n0n0m0uz - 2 months
One of the USA greatest strengths is the almost unprecedented degree of transparency of governments records going back decades. We can actually see the true facts including when our government has lied to us or covered things up. Many other nations do not have this luxury and it has provided the evidentiary basis for both legal cases and "progress" in general. Not surprising that authoritarians would target and destroy data as it makes their objective of a post-truth society that much easier
By @chrishoyle - 2 months
Beyond federal websites (.gov, .mil) there are lot of gov contractor websites that are being taken down (presumably at the demand of agencies) that contain a wealth of information and years of project research.

Some below of contractors that work with US AID:

- https://www.edu-links.org/ (taken down)

- https://www.genderlinks.org/ (taken down)

- https://usaidlearninglab.org/ (taken down)

- https://agrilinks.org/ (presumably at risk)

- https://www.climatelinks.org/ (presumably at risk)

- https://biodiversitylinks.org/ (presumably at risk)

By @cle - 2 months
I’ve been archiving data.gov for over a year now and it’s not unusual to see large fluctuations on the order of hundreds or thousands of datasets. I’ve never bothered trying to figure out what exactly is changing, maybe I should build a tool for that…
By @jl6 - 2 months
> The outlet reports that deleted datasets "disproportionately" come from environmental science agencies like the Department of Energy, National Oceanic and Atmospheric Administration (NOAA), and the Environmental Protection Agency (EPA).

Was there an EO targeting these areas?

By @dang - 2 months
Related ongoing thread:

CDC data are disappearing - https://news.ycombinator.com/item?id=42897696 - Feb 2025 (216 comments)

By @eh_why_not - 2 months
What's a good way to be an "Archivist" on a low budget these days?

Say you have a few TBs of disk space, and you're willing to capture some public datasets (or parts of them) that interest you, and publish them in a friendly jurisdiction - keyed by their MD5/SHA1 - or make them available upon request. I.e. be part of a large open-source storage network, but only for objects/datasets you're willing to store (so there are no illegal shenanigans).

Is this a use case for Torrents? What's the most suitable architecture available today for this?

By @crowcroft - 2 months
Still, even with best efforts this is such a shame. There is always going to be a question around governance over the data, integrity, and potentially chain of custody as well. If the goal is to muddy the waters and create a narrative that whatever might be in this data isn't reliable or accurate then mission accomplished. I don't see how anything can stop that.

Not to say the data isn't incredibly valuable and should be preserved for many other reasons of course. All the best to anyone archiving this, this is important work.

By @chrishoyle - 2 months
Related ongoing discussion

The government information crisis is bigger than you think it is - https://news.ycombinator.com/item?id=42895331

By @debeloo - 2 months
Is this normal when there's change in presidency?
By @smrtinsert - 2 months
Are datasets mirrored anywhere where the govt doesn't automatically have a take down authority? If not there should be a mirroring effort.
By @sunk1st - 2 months
I don’t see a list of the datasets that have gone missing. Is there a list?
By @derektank - 2 months
Does anyone know if the St Louis Federal Reserve (and I guess the federal reserve banks generally) is subject to presidential executive orders or is it entirely responsible to the Federal Reserve Board and the St. Louis Bank president? FRED is the only dataset I access regularly
By @generalizations - 2 months
Do we know what datasets these are? Do we actually have a diff here so we know what's been removed? There's a lot of assumptions being thrown around here, but we don't even know if this is some kind of malicious compliance. An actual list of what's been removed would probably clear the air a lot.

As one of the reddit comments (in the thread linked by the article) pointed out,

> During the start of Biden’s term, On 6th feb data.gov had “218,384 DATASETS” but on 7th feb it only had “192,180 DATASETS”

By @choobacker - 2 months
It's impressive that volunteers are stepping up to archive this. I understand the desire to keep this open data available.

How much of this sort of effort results in that data being used? Are there success stories for these datasets being discoverable enough and useful to others?

By @andyjohnson0 - 2 months
If the intention is to restore these data sets at some future date, when sanity has possibly been restored, then there needs to be a way to demonstrate that the archived data hasn't itself been modified. Without that, malign actors (e.g. oil/gas lobby) could very easily poison the future.
By @liontwist - 2 months
I think people are interested in archiving and the political image associated with that but I don’t think anybody cares about the content. Who is going to go back and read Biden era agency publications?
By @downrightmike - 2 months
Already seeing: 404 Not Found: Requested route ('ed-public-download.app.cloud.gov') does not exist.
By @pluto_modadic - 2 months
don't they have to have to have done this /before/ it gets deleted?
By @ThinkBeat - 2 months
I hope volunteers and others are able to save as much as possible of the data.

Removing and altering of the information and data is one of the fundamental threats in our digital world.

It probably makes the most sense to do this on a daily basis. If something new, if published, grab it as soon as possible.

Data can also be redacted or altered for a variety of reasons, being able to see the before and after states can be illuminating.

Something I feel is missing here are statistics for each administration.

Does this only happen under a Trump administration, or does it happen to smaller or larger extent under other administrations?

I don't know how far back this federal goes so it might not be easy,

By @bawolff - 2 months
Tbh, im kind of surprised these things weren't being archived as they were being published. Trump is an extreme case, but its not the first time a change in administration resulted in removing old websites.
By @notavalleyman - 2 months
I read, in past days, that the man who ordered the construction of the nearly infinite Wall of China was that First Emperor, Shih Huang Ti, who likewise ordered the burning of all the books before him. That the two gigantic operations - the five or six hundred leagues of stone to oppose the barbarians, the rigorous abolition of history, that is of the past - issued from one person and were in a certain sense his attributes, inexplicably satisfied me and, at the same time, disturbed me.

- Borges

By @honestSysAdmin - 2 months
Let's make torrents and seed them.
By @exe34 - 2 months
First week we had mass deportation, second week we've heard of the building of concentration camps for undesirables, and now the modern version of book burning. There's something different about this republican government.
By @strictnein - 2 months
99.9% of commenters here seem to have missed this:

> For example, in the days after Joe Biden was inaugurated, data.gov showed about 1,000 datasets being deleted as compared to a day before his inauguration

It's almost like this stuff happens regularly. If <insert Dem savior> wins in 2028, tons of government websites will also change in the first couple weeks of their presidency. Is it because they're a fascist dictator? Or is it because those websites reflect the administration's viewpoints on issues?

Wish people would take a deep breath and step back and think a little more. I despise Trump, but there's crying wolf and then there's the current state of media and online discourse. Trump thrives in this type of environment. He purposefully fosters it. Playing gotcha with him doesn't work because he doesn't care.