September 18th, 2024

We're losing our digital history. Can the Internet Archive save it?

The Internet Archive has preserved 866 billion web pages, but faces financial instability and legal challenges that threaten its operations, despite its vital role in digital history preservation through the Wayback Machine.

Read original articleLink Icon
We're losing our digital history. Can the Internet Archive save it?

The Internet Archive, a non-profit organization founded in 1996, is at the forefront of efforts to preserve digital history as a significant portion of web content is disappearing. Research indicates that 25% of web pages created between 2013 and 2023 have vanished, with older pages being more susceptible to loss. The Internet Archive has amassed an extensive collection, including 866 billion web pages and millions of books and videos, serving as a crucial resource for future historians. However, the organization faces numerous challenges, including financial instability, legal battles over copyright issues, and cyber threats. Recent court rulings have restricted its ability to lend digital copies of books, and ongoing lawsuits could jeopardize its operations. Despite these hurdles, the Internet Archive's Wayback Machine continues to provide access to archived web pages, helping to mitigate the loss of digital content. Other organizations, like the Library of Congress and the UK Web Archive, also contribute to digital preservation, but their efforts are limited compared to the comprehensive approach of the Internet Archive. As reliance on this resource grows, so do the risks associated with its sustainability, highlighting the fragility of our digital heritage.

- The Internet Archive has preserved 866 billion web pages, but 25% of web content from 2013-2023 has disappeared.

- Legal challenges and financial instability threaten the Internet Archive's operations and its ability to lend digital copies.

- The Wayback Machine is a vital tool for accessing archived web pages, helping to preserve digital history.

- Other organizations contribute to digital preservation, but their efforts are not as extensive as those of the Internet Archive.

- Cyber threats and technical challenges pose ongoing risks to the preservation of digital content.

Link Icon 4 comments
By @daniel31x13 - 7 months
> Research shows 25% of web pages posted between 2013 and 2023 have vanished.

I’ve been personally working on a project over the past year which addresses the exact issue: https://linkwarden.app

An open-source [1] bookmarking tool to collect, organize and preserve contents on the internet.

[1]: https://github.com/linkwarden/linkwarden

By @geye1234 - 7 months
I've been trying to download various blogs, on blogspot.com and wordpress.com, as well as a couple now only on archive.org, using Linux CLI tools. I cannot make it work. Everything either seems to miss css, or jumps the wrong number of links, or stops arbitrarily, or has some other problem.

If I had a couple of days to devote to it entirely, I think I could make it work, but I've had to be sporadic, although it's cost me a ton of time cumulatively. I've tried wget, httrack, and a couple of other more obscure tools -- all with various options and parameters of course.

One issue is that blog info is duplicated -- you might get domainname.com/article/article.html; domainname.com/page/1; and domainname.com/2015/10/01; all of which contain the same links. Could there be some vicious circularity taking place, causing the downloader to be confused about what it's done and what it has yet to do? I wouldn't think so, but static, non-blog pages are obviously much simpler than blogs.

Anyway, is there a known, standardized way to download blogs? I haven't yet found one. But it seems such a common use case! Does anybody have any advice?

By @HocusLocus - 7 months
I've been trying to extract historycommons.org from wayback and it is an uphill battle, even to grab the ~198 pages it says it collected. Even back in the days after 9/11 when it rose to prominence I was shuddering at the site's dynamically served implementation. These were the days of Java and they loaded down the server side with CPU time when it'd rather be serving static items... from REAL directories. With REAL If-Modified-Since: virtual support file attributes set from the combined database update times ... which seems to have gone by the wayside on the Internet completely.

Everything everywhere is now Last-Modified today, now, just for YOU! Even if it hasn't changed. Doesn't that make you happy? Do you have a PROBLEM with that??

Everything unique at the site was after the ? and there was more than one way to get 'there', there being anywhere.

I suspect that many tried to whack the site then finally gave up. I got a near-successful whack once after lots of experimenting, but said to myself then "This thing will go away, and it's sad".

That treasure is not reliably archived.

Suggestion: Even if the whole site is spawned from a database, choose a view that presents everything once and only once, and present to the world a group of pages that completely divulge the content with slash-separators only /x/y/z/xxx.(html|jpg|etc) with no duplicitous tangents IF the crawler ignores everything after the ? ... and place actual static items in a hierarchy. The most satisfying crawl is one where you can do this, knowing that the archive will be complete and relevant and there is no need to 'attack' the server side with processes-spawning.

By @alganet - 7 months
One question seems obvious:

With AIs and stuff, are we saving humanity's digital history, or are we saving a swarm of potentially biased auto-generated content published by the few that can afford the large scale deployment of LLMs?