A Large-Scale Structured Database of a Century of Historical News
A large-scale database named Newswire was created, containing 2.7 million U.S. newswire articles from 1878 to 1977. Reconstructed using deep learning, it aids research in language modeling, computational linguistics, and social sciences.
Read original articleThe article discusses the creation of a large-scale structured database called Newswire, containing 2.7 million U.S. newswire articles from 1878 to 1977. The database was reconstructed using deep learning techniques on raw image scans from local newspapers, georeferencing locations, tagging topics, and disambiguating individuals to Wikipedia. The dataset includes metadata from the Library of Congress about the newspapers. This dataset is valuable for language modeling, computational linguistics, social science, and digital humanities research. It provides insights into the historical news content that shaped national identity and understanding. The Newswire dataset offers a unique resource for expanding training data beyond modern web texts and studying various interdisciplinary questions.
Related
GitHub – Karpathy/LLM101n: LLM101n: Let's Build a Storyteller
The GitHub repository "LLM101n: Let's build a Storyteller" offers a course on creating a Storyteller AI Large Language Model using Python, C, and CUDA. It caters to beginners, covering language modeling, deployment, programming, data types, deep learning, and neural nets. Additional chapters and appendices are available for further exploration.
Delving into ChatGPT usage in academic writing through excess vocabulary
A study by Dmitry Kobak et al. examines ChatGPT's impact on academic writing, finding increased usage in PubMed abstracts. Concerns arise over accuracy and bias despite advanced text generation capabilities.
Surprise, your data warehouse can RAG
A blog post by Maciej Gryka explores "Retrieval-Augmented Generation" (RAG) to enhance AI systems. It discusses building RAG pipelines, using text embeddings for data retrieval, and optimizing data infrastructure for effective implementation.
The End of an Era: MTV News Shuts Down… Or why archives are important
MTV News website shutdown erases 20 years of content, emphasizing the importance of archiving online data. Loss impacts journalists, writers, and researchers, highlighting the need for preserving cultural and historical information.
Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs
The study presents a method to boost Large Language Models' retrieval and reasoning abilities for long-context inputs by fine-tuning on a synthetic dataset. Results show significant improvements in information retrieval and reasoning skills.
Makes me wonder if this is a bigger problem.
So, what items of news cause the most "change" (or surprise) to that group of people. All our understandings and notions of truth are based off of large chains of reasoning that are based off of premises and values. When a new event happens that changes a premise our understandings are based on, which events cause the largest changes in those dags?
We're regularly subjected to "news" that doesn't change anything very much, while subtle events of deep impact are regularly missed. Maybe it would be a way to surface those things. I wonder if someone smarter than me could analyze a data set such as this and come up with a revised set of headlines and articles over the years that do the best job of communicating the most important changes.
TLDR: it makes sense :)
It's so nice and refreshing to see something like this, instead of the common "we tweaked this and that thing and got better results on this and that benchmark."
Thank you and congratulations to the authors!
Related
GitHub – Karpathy/LLM101n: LLM101n: Let's Build a Storyteller
The GitHub repository "LLM101n: Let's build a Storyteller" offers a course on creating a Storyteller AI Large Language Model using Python, C, and CUDA. It caters to beginners, covering language modeling, deployment, programming, data types, deep learning, and neural nets. Additional chapters and appendices are available for further exploration.
Delving into ChatGPT usage in academic writing through excess vocabulary
A study by Dmitry Kobak et al. examines ChatGPT's impact on academic writing, finding increased usage in PubMed abstracts. Concerns arise over accuracy and bias despite advanced text generation capabilities.
Surprise, your data warehouse can RAG
A blog post by Maciej Gryka explores "Retrieval-Augmented Generation" (RAG) to enhance AI systems. It discusses building RAG pipelines, using text embeddings for data retrieval, and optimizing data infrastructure for effective implementation.
The End of an Era: MTV News Shuts Down… Or why archives are important
MTV News website shutdown erases 20 years of content, emphasizing the importance of archiving online data. Loss impacts journalists, writers, and researchers, highlighting the need for preserving cultural and historical information.
Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs
The study presents a method to boost Large Language Models' retrieval and reasoning abilities for long-context inputs by fine-tuning on a synthetic dataset. Results show significant improvements in information retrieval and reasoning skills.