July 9th, 2024

Storing Scraped Data in an SQLite Database on GitHub

The article explains Git scraping, saving data to a Git repository with GitHub Actions. Benefits include historical tracking and using SQLite for storage. Limitations and Datasette for data visualization are discussed.

Read original articleLink Icon
Storing Scraped Data in an SQLite Database on GitHub

The article discusses the concept of Git scraping, where data is scraped and saved directly into a Git repository using GitHub Actions. It highlights the benefits of this approach, such as historical data tracking through commits. The author also explores using SQLite as a database format and proposes storing scraped data in an SQLite database within GitHub Artifacts. A detailed workflow is provided, showcasing how to automate web scraping using GitHub Actions and SQLite. The article mentions some limitations, like long-running jobs and GitHub Artifacts retention limits. Additionally, it suggests using Datasette to visualize and interact with the data stored in SQLite. The author concludes by acknowledging the scalability limitations of this approach but expresses enjoyment in experimenting with the setup. The article provides a comprehensive guide for setting up an automated web scraping system using GitHub Actions and SQLite within the GitHub ecosystem.

Related

Deep Dive into GitHub Actions Docker Builds with Docker Desktop

Deep Dive into GitHub Actions Docker Builds with Docker Desktop

The beta release of Docker Desktop 4.31 introduces a feature enabling users to inspect GitHub Actions builds within the platform. It offers detailed performance metrics, cache utilization, and configuration details for enhanced visibility and collaboration.

How I scraped 6 years of Reddit posts in JSON

How I scraped 6 years of Reddit posts in JSON

The article covers scraping 6 years of Reddit posts for self-promotion data, highlighting challenges like post limits and cutoffs. Pushshift is suggested for Reddit archives. Extracting URLs and checking website status are explained. Findings reveal 40% of sites inactive. Trends in online startups are discussed.

Getting Your DBA Teams Scripts into Git

Getting Your DBA Teams Scripts into Git

Managing DBA scripts is crucial. Git aids in versioning and tracking changes, promoting organization and collaboration. Centralizing scripts on GitHub ensures control and enhances team efficiency, fostering better script management.

Simple GitHub Actions Techniques

Simple GitHub Actions Techniques

Denis Palnitsky's Medium article explores advanced GitHub Actions techniques like caching, reusable workflows, self-hosted runners, third-party managed runners, GitHub Container Registry, and local workflow debugging with Act. These strategies aim to enhance workflow efficiency and cost-effectiveness.

Let's Treat Docs Like Code

Let's Treat Docs Like Code

Treating documentation like code involves using tools like GitHub, automation, and static site generators. Importance of learning these tools, best practices for efficient writing, protecting branches, case studies, and resources are discussed. Insights on building documentation sites are provided.

Link Icon 2 comments
By @kristianp - 6 months
It's fun to test the boundaries of github's services, but if you're doing something useful I'd just hire a vps, they can be had from $5 a month. You could still upload the sqlite file to github via a check-in.
By @chatmasta - 6 months
Presumably you can bypass the artifact retention limit by uploading them as release artifacts (which are retained forever) rather than job artifacts.

(Not that I’d advocate for this in general, since ultimately you’re duplicating a bunch of data and will eventually catch the eye of some GitHub compliance script.)