June 30th, 2024

How I scraped 6 years of Reddit posts in JSON

The article covers scraping 6 years of Reddit posts for self-promotion data, highlighting challenges like post limits and cutoffs. Pushshift is suggested for Reddit archives. Extracting URLs and checking website status are explained. Findings reveal 40% of sites inactive. Trends in online startups are discussed.

Read original article

How I scraped 6 years of Reddit posts in JSON

The article discusses the process of scraping 6 years of Reddit posts in JSON format, focusing on self-promotion posts from specific subreddits. The author details the challenges faced in scraping Reddit data due to limitations on the number of posts that can be pulled and the cutoff after 14 days. An alternative data provider, Pushshift, is mentioned as a source for Reddit archives. The author also explains the method used to extract URLs from the posts and how to check if a website is operational by sending a HEAD request. Results show that around 40% of websites are abandoned or non-operational. The article concludes with insights on the trends observed in online business startups based on the data collected. Additionally, the author shares resources for further exploration of the data and expresses learnings from the scraping process.

The demise of the mildly dynamic website (2022)

The evolution of websites from hand-crafted HTML to PHP enabled dynamic web apps with simple deployment. PHP's decline led to static site generators replacing mildly dynamic sites, shifting to JavaScript for features like comments.

Simple ways to find exposed sensitive information

Various methods to find exposed sensitive information are discussed, including search engine dorking, Github searches, and PublicWWW for hardcoded API keys. Risks of misconfigured AWS S3 buckets are highlighted, stressing data confidentiality.

Surfing the (Human-Made) Internet

The internet's evolution prompts a return to its human side, advocating for personal sites, niche content, and self-hosted platforms. Strategies include exploring blogrolls, creating link directories, and using alternative search engines. Embrace decentralized social media and RSS feeds for enriched online experiences.

Serving a billion web requests with boring code

The author shares insights from redesigning the Medicare Plan Compare website for the US government, focusing on stability and simplicity using technologies like Postgres, Golang, and React. Collaboration and dedication were key to success.

Show HN: Linkgrabs.com the Simple and Fast API to Fetch JavaScript Web Pages

LinkGrabs.com offers API service for fetching web pages at $0.0005 per grab. Grab bags cost $5.00 with 10,000 grabs. New accounts get 33 free grabs and can request 333 more. API returns client-side rendered content.

2 comments

By @ssahoo - 10 months

No you didn't. You just downloaded the torrent archive.

How I scraped 6 years of Reddit posts in JSON

Related

The demise of the mildly dynamic website (2022)

Simple ways to find exposed sensitive information

Surfing the (Human-Made) Internet

Serving a billion web requests with boring code

Show HN: Linkgrabs.com the Simple and Fast API to Fetch JavaScript Web Pages

Related

The demise of the mildly dynamic website (2022)

Simple ways to find exposed sensitive information

Surfing the (Human-Made) Internet

Serving a billion web requests with boring code

Show HN: Linkgrabs.com the Simple and Fast API to Fetch JavaScript Web Pages