Announcing the data.gov archive
The Library Innovation Lab launched the 16TB Data.gov Archive with over 311,000 datasets, aiming to preserve public data for research and policymaking, supported by the Filecoin Foundation and Rockefeller Brothers Fund.
Read original articleThe Library Innovation Lab has announced the launch of the Data.gov Archive on Source Cooperative, which includes a substantial collection of 16TB comprising over 311,000 datasets harvested from data.gov during 2024 and 2025. This archive aims to preserve and authenticate essential public datasets for academic research, policymaking, and public use, with daily updates as new datasets are added. The initiative reflects the Lab's commitment to safeguarding government records and ensuring public access to information. The project includes detailed metadata and digital signatures to enhance the integrity and provenance of the datasets, facilitating easier citation and access for researchers and the public. Additionally, the Lab is providing open-source software and documentation to enable others to replicate their efforts in creating similar repositories. This initiative builds on previous projects such as the Perma.cc web archiving tool and the Caselaw Access Project, and it is supported by the Filecoin Foundation for the Decentralized Web and the Rockefeller Brothers Fund. The Lab encourages suggestions and collaboration for future releases via their contact email.
- The Data.gov Archive contains over 311,000 datasets and is 16TB in size.
- The archive will be updated daily with new datasets from data.gov.
- The project aims to preserve public datasets for research and policymaking.
- Open-source software and documentation will be provided for replication of the project.
- The initiative is supported by the Filecoin Foundation and the Rockefeller Brothers Fund.
Related
Internet Archive and Library and Archives Canada Launches Digitization Project
Internet Archive Canada and Library and Archives Canada collaborate to digitize 100,000 historical publications from the 1200s to 1920. The project aims to enhance access to information, benefiting researchers and the public.
Internet Archive starts backing up digital books on paper
The Internet Archive is backing up digital books by storing physical copies of 10 million works in climate-controlled conditions, addressing digital storage reliability concerns and seeking contributions from libraries and collectors.
We're losing our digital history. Can the Internet Archive save it?
The Internet Archive has preserved 866 billion web pages, but faces financial instability and legal challenges. Its Wayback Machine is crucial for accessing historical content, despite ongoing risks to its operations.
Archive Team
Archive Team, active since 2009, preserves digital heritage by archiving content from at-risk platforms like Telegram and YouTube, while encouraging public participation and providing resources for data management.
Harvard Is Releasing a Free AI Training Dataset
Harvard University is releasing a dataset of nearly 1 million public-domain books for AI training, funded by Microsoft and OpenAI, to promote equitable access amid ongoing legal challenges regarding copyrighted materials.
This is the same thing that's been bothering me with archive.org lately, by the way. I haven't found a good way to simply (for some reasonable definition definition of "simple") contribute 10 TiB or so of redundant storage on my (european) home server either. That kind of thing might (have to) serve to ensure tamper-resistance for that data, given the current political climate on both sides of the pond. Any pointers welcome.
Theres something that about this that just rings second amendment. Personally I think the concept of civilians having weapons to be a check on a nation state is absurd, but in this case it feels pretty empowering.
This is a topic that came up at work today as we rely on this data and are considering backing up most of the Lidar data from there ourselves (100s of TB probably)
EDIT: no, looks like it is only the footprints
I haven't had much time to look at this yet and see what all is there, but whether currently included or not, a couple of things I really hope get archived are the contents of the DTIC (Defense Technical Information Center) document repository (lots of really interesting older scientific publications) and the NASA TRS (Technical Report Server).
I'm working on my own archive of at least some portion of the DTIC stuff just to be on the safe side. So far everything I've tried to access is still there, but who knows how long that will last.
Related
Internet Archive and Library and Archives Canada Launches Digitization Project
Internet Archive Canada and Library and Archives Canada collaborate to digitize 100,000 historical publications from the 1200s to 1920. The project aims to enhance access to information, benefiting researchers and the public.
Internet Archive starts backing up digital books on paper
The Internet Archive is backing up digital books by storing physical copies of 10 million works in climate-controlled conditions, addressing digital storage reliability concerns and seeking contributions from libraries and collectors.
We're losing our digital history. Can the Internet Archive save it?
The Internet Archive has preserved 866 billion web pages, but faces financial instability and legal challenges. Its Wayback Machine is crucial for accessing historical content, despite ongoing risks to its operations.
Archive Team
Archive Team, active since 2009, preserves digital heritage by archiving content from at-risk platforms like Telegram and YouTube, while encouraging public participation and providing resources for data management.
Harvard Is Releasing a Free AI Training Dataset
Harvard University is releasing a dataset of nearly 1 million public-domain books for AI training, funded by Microsoft and OpenAI, to promote equitable access amid ongoing legal challenges regarding copyrighted materials.