February 7th, 2025

Announcing the data.gov archive

The Library Innovation Lab launched the 16TB Data.gov Archive with over 311,000 datasets, aiming to preserve public data for research and policymaking, supported by the Filecoin Foundation and Rockefeller Brothers Fund.

Read original article

The Library Innovation Lab has announced the launch of the Data.gov Archive on Source Cooperative, which includes a substantial collection of 16TB comprising over 311,000 datasets harvested from data.gov during 2024 and 2025. This archive aims to preserve and authenticate essential public datasets for academic research, policymaking, and public use, with daily updates as new datasets are added. The initiative reflects the Lab's commitment to safeguarding government records and ensuring public access to information. The project includes detailed metadata and digital signatures to enhance the integrity and provenance of the datasets, facilitating easier citation and access for researchers and the public. Additionally, the Lab is providing open-source software and documentation to enable others to replicate their efforts in creating similar repositories. This initiative builds on previous projects such as the Perma.cc web archiving tool and the Caselaw Access Project, and it is supported by the Filecoin Foundation for the Decentralized Web and the Rockefeller Brothers Fund. The Lab encourages suggestions and collaboration for future releases via their contact email.

- The Data.gov Archive contains over 311,000 datasets and is 16TB in size.

- The archive will be updated daily with new datasets from data.gov.

- The project aims to preserve public datasets for research and policymaking.

- Open-source software and documentation will be provided for replication of the project.

- The initiative is supported by the Filecoin Foundation and the Rockefeller Brothers Fund.

Internet Archive and Library and Archives Canada Launches Digitization Project

Internet Archive Canada and Library and Archives Canada collaborate to digitize 100,000 historical publications from the 1200s to 1920. The project aims to enhance access to information, benefiting researchers and the public.

Internet Archive starts backing up digital books on paper

The Internet Archive is backing up digital books by storing physical copies of 10 million works in climate-controlled conditions, addressing digital storage reliability concerns and seeking contributions from libraries and collectors.

We're losing our digital history. Can the Internet Archive save it?

The Internet Archive has preserved 866 billion web pages, but faces financial instability and legal challenges. Its Wayback Machine is crucial for accessing historical content, despite ongoing risks to its operations.

Archive Team

Archive Team, active since 2009, preserves digital heritage by archiving content from at-risk platforms like Telegram and YouTube, while encouraging public participation and providing resources for data management.

Harvard Is Releasing a Free AI Training Dataset

Harvard University is releasing a dataset of nearly 1 million public-domain books for AI training, funded by Microsoft and OpenAI, to promote equitable access amid ongoing legal challenges regarding copyrighted materials.

11 comments

By @black_puppydog - 2 months

Great to see there's some resistance. What I'm missing from this announcement though is any mention of how they intend to secure this "vault" against the current government. I'm assuming good intentions on the part of Harvard, but keeping this data online against the express will of the government is gonna cost (political) capital. And from what I can see, the archive is hosted by US entities on US-controlled servers on US soil?

This is the same thing that's been bothering me with archive.org lately, by the way. I haven't found a good way to simply (for some reasonable definition definition of "simple") contribute 10 TiB or so of redundant storage on my (european) home server either. That kind of thing might (have to) serve to ensure tamper-resistance for that data, given the current political climate on both sides of the pond. Any pointers welcome.

By @cyberlimerence - 2 months

Is anyone out there archiving USGS/NOAA datasets ? It sounds ridiculous, but this appears to be where we are now. There is a submission about NOAA on the frontpage now: "Scientists on alert as NOAA restricts contact with foreign nationals" [1]

[1] https://news.ycombinator.com/item?id=42970814

By @Rebuff5007 - 2 months

I find it assuming that the might of the American government -- in trying to take a bunch of data offline -- is being resisted by a digital "militia" of hobbyist archivers and non profits.

Theres something that about this that just rings second amendment. Personally I think the concept of civilians having weapons to be a check on a nation state is absurd, but in this case it feels pretty empowering.

By @fnands - 2 months

Will does this include all USGS data?

This is a topic that came up at work today as we rely on this data and are considering backing up most of the Lidar data from there ourselves (100s of TB probably)

EDIT: no, looks like it is only the footprints

By @mindcrime - 2 months

Very happy this is happening. There's a ridiculous amount of incredibly valuable data, scientific documents, etc. "out there" that are at risk.

I haven't had much time to look at this yet and see what all is there, but whether currently included or not, a couple of things I really hope get archived are the contents of the DTIC (Defense Technical Information Center) document repository (lots of really interesting older scientific publications) and the NASA TRS (Technical Report Server).

I'm working on my own archive of at least some portion of the DTIC stuff just to be on the safe side. So far everything I've tried to access is still there, but who knows how long that will last.

By @fredoliveira - 2 months

Honestly a shame it has to come to this. Sure, people elected this administration and I guess with that comes with a bunch things I disagree with. But the removal of years of scientific research and data from the web (paid for by citizens with their taxes) is absolutely unacceptable. Ravaging CDC data, climate data, etc is horrendous and unforgivable.

By @frontalier - 2 months

this archive is going to disappear before the summer comes

https://youtu.be/5RpPTRcz1no?t=1511

By @govideo - 2 months

From the post: Today we released our archive of data.gov on Source Cooperative. The 16TB collection includes over 311,000 datasets harvested during 2024 and 2025, a complete archive of federal public datasets linked by data.gov. It will be updated daily as new datasets are added to data.gov. This is the first release in our new data vault project to preserve and authenticate vital public datasets for academic research, policymaking, and public use.

Announcing the data.gov archive

Related

Internet Archive and Library and Archives Canada Launches Digitization Project

Internet Archive starts backing up digital books on paper

We're losing our digital history. Can the Internet Archive save it?

Archive Team

Harvard Is Releasing a Free AI Training Dataset

Related

Internet Archive and Library and Archives Canada Launches Digitization Project

Internet Archive starts backing up digital books on paper

We're losing our digital history. Can the Internet Archive save it?

Archive Team

Harvard Is Releasing a Free AI Training Dataset