August 3rd, 2024

To preserve their work journalists take archiving into their own hands

As news organizations shut down, journalists are increasingly archiving their work to preserve history. Tools like the Wayback Machine and personal records are vital for safeguarding their contributions.

Read original articleLink Icon
ConcernAppreciationFrustration
To preserve their work journalists take archiving into their own hands

As news websites increasingly shut down, journalists are taking the initiative to preserve their work and the historical context of their reporting. Many news organizations do not prioritize archiving their content, leading to significant losses when sites go dark. Recent examples include the closure of MTV News and the temporary disappearance of Deadspin's archives. A 2021 report indicated that only 7 out of 24 newsrooms were fully preserving their content. Journalists face personal and professional challenges when their work is lost, prompting them to seek creative solutions for archiving. Some utilize tools like the Wayback Machine, while others maintain meticulous records using platforms like AirTable. Freelance reporter Andrea Gutierrez emphasizes the importance of personal archiving, noting that any outlet could close unexpectedly. Matthew Gault, a former reporter for Vice, recounts how his wife developed a scraper to save his articles as the company faced closure. This highlights the urgency and necessity of self-archiving in the current media landscape. While larger legacy outlets may have better resources for preservation, they too must adapt to evolving technologies and potential threats like ransomware. The responsibility for maintaining a record of journalistic work increasingly falls on individual journalists, who must navigate the complexities of digital content preservation to ensure their contributions are not lost to history.

AI: What people are saying
The comments reflect a strong interest in the preservation of digital content and the challenges associated with archiving.
  • Many users emphasize the importance of personal archiving, sharing experiences of saving their own work to prevent loss.
  • Concerns are raised about the reliability of URLs and the need for permanent identifiers for digital documents.
  • Some commenters suggest collaboration with established archiving organizations like Archive.org to ensure content remains accessible.
  • Legal issues regarding ownership of archived work are discussed, particularly in relation to journalists and their employers.
  • There are calls for the creation of new platforms or repositories dedicated to immutable archiving of articles and digital content.
Link Icon 16 comments
By @walterbell - 4 months
> “Thank goodness she did that because [otherwise] we would have no records of the early years of the first Women’s Hockey League in Canada,” Azzi said.

A few years ago, Canada digitized many older television shows, https://news.ycombinator.com/item?id=35716982

  With the help of many industry partners, the [Canada Media Fund] CMF team unearthed Canadian gems buried in analog catalogues. Once discovered, we worked to secure permissions and required rights and collaborate with third parties to digitize the works, including an invaluable partnership with Deluxe Canada that covered 40 per cent of the digitization costs. The new, high-quality digital masters were made available to the rights holders and released to the public on the Encore+ YouTube channel in English and French.
In late 2022, the channel deleted the entire Youtube Encore archive of Canadian television, with two weeks notice. A few months later, half of the archive resurfaced on https://archive.org/search?query=creator%3A%22Encore%20%2B%2.... If anyone independently archived the missing Encore videos from Youtube, please mirror them to Archive.org.
By @tamimio - 4 months
Good thing I’m a hoarder. If I like something, I archive it and back it up locally. For example, a couple of days ago, I needed some digital assets for an Adobe program that I had downloaded a few months ago because I liked them and thought I might need them in the future. When I went back to the company page a couple days ago, everything had vanished! I'm glad I had downloaded them before and checked my backup to retrieve them.
By @vasco - 4 months
A nice social attack is to create an internet archive looking website call it archive.newtld and use it to create social proof of things you didn't actually do. "Oh yeah the Washington Post did a redesign but here are my past 10 posts which I saved in archive: link"

In post truth internet, proving archives is going to be tough and unless there's some other form of verification it's going to be useless fast for "proving" purposes.

You can think bigger and do this to forge stories about anything you want in any website. Nobody checks authenticity of archive urls and there's several sites already, plus a lot of these services do URL re-writing, so it's hard unless there's some authorative thing.

By @vertis - 4 months
I'm not a particularly good writer, but I've written about how I use the SingleFile extension to capture a perma web version of everything interesting that I read[0]. It's a great open source tool that aids in archiving (even if only at the personal level).

I've been taking notes and blogging since the early 2000s and coming back so often to find the content that I'd linked to has disappeared.

Archive.org and Archive Team do amazing work, but it's a mistake to put all your archiving eggs in one basket.

[0]: https://vertis.io/2024/01/26/how-singlefile-transformed-my-o...

By @Triphibian - 4 months
One of the great ironies of this situation is many of the now defunct websites had contracts and writing agreements that were absolutely egregious. Often the boilerplate would say that they owned the article (which they paid you a pittance for) until the end of all time.

Prior, in the print era, the standard agreement was they'd have the rights to your story upon publication then after a reasonable amount of time the rights would revert to the author.

By @8bitsrule - 4 months
Too bad there's not something like DOI or ARK [0] available for anyone to use to give documents a searchable, permanent ID so that a location can be maintained. IME, the half-life of many URLs (5-10 years?) makes them unreliable. I recently was unable to find (by URL) an entire historical collection at a major southern US university until I discovered that it had been moved to a new server.

[0] https://arks.org/about/

By @ThinkBeat - 4 months
I fully support the efforts. but are there not legal problems with this? (No I dont thik legal issues should prevent this)

If I worked for CorporateMediaNews as a columnist and reporter for 10 years and they decide ot remove all of it. Does not CMN own the work and can (unfortunately) dispose of it if they so wish? I would not have any rights for the work?

Thinking about my own career. I have written a hell of a lot of code and 80% at least are closed source system for various companies. I dont retain any copies of that code.

It would be interesting if I heard that System X I wrote 15 years ago is being shut down, and I would try to obtain the source code in order to preserve it. I have never heard of anyone doing it, but probably in games and such it happens more often.

By @whartung - 4 months
My understanding is that some photographers are archiving their digital pictures by basically printing them using the 4 color process, which gives them the 4 "negatives" ("postives?" or whatever they're called) for each color (CMYK I guess).

Those sheets are archival quality and should last for quite sometime, given a proper storage environment.

They can always use those later to have them scanned back in should they lose their master digital files.

By @JohnFen - 4 months
This is the only way, in my opinion. Not just for journalists, but for all professions. If you haven't archived it yourself, on machines and/or media that you are in possession of, then you can't rely on it to continue to persist.
By @ghaff - 4 months
This has been true for a long time. Had I not archived a fair bit of my own work, some of it in the CMSs of dead organizations, some of it inaccessible behind paywalls, much would no longer exist. Journalists are probably in better shape than many because they're more likely to have work they've created on a relatively open web.
By @jfil - 4 months
It's great to see more non-programmers realize how ephemeral Web content is, and taken bare-bones archiving efforts.

If you or someone you know are looking to archive content from the Web, but don't know how, I'll be happy to help. My email is in my profile.

By @bzmrgonz - 4 months
Someone should create an immutable article hub repository. They can call it pubark.
By @gerdesj - 4 months
Journos discover backups!

Yay.

By @wkat4242 - 4 months
They should really collaborate with archive.org. They won't shut things down or paywall it.
By @kelsey98765431 - 4 months
Ever since the NYT legal case against OpenAI (pronounce: ClosedASI, not FossAGI; free as in your data for them, not free as in beer) there seems to be an underground current pulling into a riptide of closed information access on the web. Humorously enough the zimmit project has been quietly updating the living heck out of itself and awakening from a nearly 6-8 year slumber. The once simple format for making a mediawiki offline archive now is able to mirror any website complete with content such as video, pdf, or other files.

It feels a lot like the end of usenet or geocities, but this time without the incentive for the archivists to share their collections as openly. I am certain full scrapes of reddit and twitter exist, even with post API closure changes, but we will likely never see these leave large AI companies internal data holdings.

I have taken it upon myself to begin using the updated zimmit docker container to start archiving swaths of the 'useful web', meaning not just high quality language tokens, but high quality citations and knowledge built with sources that are not just links to other places online.

I started saving all my starred github repos into a folder and it came out just around 125gb of code.

I am terrified that in the very short term future a lot of this content will either become paywalled or the financial incentives of hosting large information repositories will increase past the point of current ad revenue based models as more powerful larger scraping operations seek to fill their petabytes while i try to prevent my few small TB of content i dont want to lose from slipping through my fingers.

If anyone actually cares deeply about content preservation, go and buy yourself a few 10+ TB external disks and grab a copy of zimmit and start pulling stuff. Put it on archive.org and tag it. So far the only zim files I see on archive.org are the ones publicly released by the kiwix team yet there is an entire wiki of wikis called wikiindex that remains almost completely unscraped. Fandom and Wikia are gigantic repositories of information and I fear they will close themselves up sooner than later, while many of the smaller info stores we have all come to take for granted as being "at our fingertips" will slowly slip away.

I first noticed the deep web deepening when things I used to be able to find on google were no longer showing up no matter how well I knew the content I was searching for, no matter the complex dorking i attempted using operators in the search bar, just like it had vanished. For a time bing was excellent at finding these "scrubbed" sites. Then duckduckgo entered the chat, and bing started to close itself down more. Bing was just a scrape of google, and google stopped being reliable, so downstream "search indexers" just became micro googles that were slightly out of date with slightly worse search accuracy, but those ghost pages were now being "anti-propagated" into these downstream indexers.

Yandex became and is still my preferred search engine when I actually need to find something online, especially when using operators to narrow wide pools.

I have found some rough edges with zimmit and I am planning on investigating and even submitting some PR upstream, but when an archive attempt takes 3 days to run before crashing and then wiping out the progress it has been hard to debug without the FOMO hitting that I should be spending the time getting what I can now before coming back to work on the code and get everything properly.

If any have the time to commit to the project and help make it more stable, perhaps work on some more fault recovery or failure continuation it would make archivists like me who are strapped for time very very happy.

Please go and make a dent in this, news is not the only part of the web i feel could be lost forever if we do not act to preserve it.

In 5 years time I see generic web searches being considered a legacy software and eventually decommissioned in favor of AI native conversational search (blow my brains out). I know for a fact all AI companies are doing massive data collection and structuring for graphrag style operations, my fear is that when its working well enough search will just vanish until a group of hobbyists make it available to us again.