June 30th, 2024

A Large-Scale Structured Database of a Century of Historical News

A large-scale database named Newswire was created, containing 2.7 million U.S. newswire articles from 1878 to 1977. Reconstructed using deep learning, it aids research in language modeling, computational linguistics, and social sciences.

Read original articleLink Icon
A Large-Scale Structured Database of a Century of Historical News

The article discusses the creation of a large-scale structured database called Newswire, containing 2.7 million U.S. newswire articles from 1878 to 1977. The database was reconstructed using deep learning techniques on raw image scans from local newspapers, georeferencing locations, tagging topics, and disambiguating individuals to Wikipedia. The dataset includes metadata from the Library of Congress about the newspapers. This dataset is valuable for language modeling, computational linguistics, social science, and digital humanities research. It provides insights into the historical news content that shaped national identity and understanding. The Newswire dataset offers a unique resource for expanding training data beyond modern web texts and studying various interdisciplinary questions.

Related

GitHub – Karpathy/LLM101n: LLM101n: Let's Build a Storyteller

GitHub – Karpathy/LLM101n: LLM101n: Let's Build a Storyteller

The GitHub repository "LLM101n: Let's build a Storyteller" offers a course on creating a Storyteller AI Large Language Model using Python, C, and CUDA. It caters to beginners, covering language modeling, deployment, programming, data types, deep learning, and neural nets. Additional chapters and appendices are available for further exploration.

Delving into ChatGPT usage in academic writing through excess vocabulary

Delving into ChatGPT usage in academic writing through excess vocabulary

A study by Dmitry Kobak et al. examines ChatGPT's impact on academic writing, finding increased usage in PubMed abstracts. Concerns arise over accuracy and bias despite advanced text generation capabilities.

Surprise, your data warehouse can RAG

Surprise, your data warehouse can RAG

A blog post by Maciej Gryka explores "Retrieval-Augmented Generation" (RAG) to enhance AI systems. It discusses building RAG pipelines, using text embeddings for data retrieval, and optimizing data infrastructure for effective implementation.

The End of an Era: MTV News Shuts Down… Or why archives are important

The End of an Era: MTV News Shuts Down… Or why archives are important

MTV News website shutdown erases 20 years of content, emphasizing the importance of archiving online data. Loss impacts journalists, writers, and researchers, highlighting the need for preserving cultural and historical information.

Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs

Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs

The study presents a method to boost Large Language Models' retrieval and reasoning abilities for long-context inputs by fine-tuning on a synthetic dataset. Results show significant improvements in information retrieval and reasoning skills.

Link Icon 11 comments
By @rileyphone - 7 months
According to the paper, the dataset goes up to 1978 because that's when copyright law was updated to automatically apply to newswires. It's unfortunate that we got into the situation where academia has to play by the rules wrt copyright while big private labs flaunt it.
By @runningmike - 7 months
In the text is stated’ All code used to create Newswire is available at our public Github repository. ’ But a check learned “(This repository is empty.” So not yet open science..
By @dbglog - 7 months
This seems like a serious scholarly work and a contribution no matter how you cut it. Thanks to the team that puts this out.
By @donohoe - 7 months
Just randomly searched for a term and the article data appears full of typos and OCR mistakes in the sample I used.

Makes me wonder if this is a bigger problem.

By @tunesmith - 7 months
I've wondered for a while if a new kind of news could be fashioned from the events of the world. Basically, if I were to try and define what items are most newsworthy, it'd be along the lines of what (according to my pop-culture understanding) Claude Shannon described, of the items that are most "surprising". But while we normally look at surprise as a subjective thing that impacts only one person, like an item that most conflicts with my current understanding and requires me to update my own model the most, we'd instead apply that to very large groups of people, or whatever group the "news" is being tuned to.

So, what items of news cause the most "change" (or surprise) to that group of people. All our understandings and notions of truth are based off of large chains of reasoning that are based off of premises and values. When a new event happens that changes a premise our understandings are based on, which events cause the largest changes in those dags?

We're regularly subjected to "news" that doesn't change anything very much, while subtle events of deep impact are regularly missed. Maybe it would be a way to surface those things. I wonder if someone smarter than me could analyze a data set such as this and come up with a revised set of headlines and articles over the years that do the best job of communicating the most important changes.

By @1propionyl - 7 months
Really good stuff. On the ground, in the weeds data collection and collation like this is sadly often thankless (and poorly funded) work, but it's a huge force multiplier to downstream research and should always be commended.
By @zX41ZdbW - 7 months
First look at the data: https://pastila.nl/?05ee30a0/be7f1715c7de106b95cccd9385a6c2e...

TLDR: it makes sense :)

By @cs702 - 7 months
Looks like fantastic work.

It's so nice and refreshing to see something like this, instead of the common "we tweaked this and that thing and got better results on this and that benchmark."

Thank you and congratulations to the authors!

By @KaiserPro - 7 months
Wait, so AP didn't/doesn't have an archive?