How I scraped 6 years of Reddit posts in JSON
The article covers scraping 6 years of Reddit posts for self-promotion data, highlighting challenges like post limits and cutoffs. Pushshift is suggested for Reddit archives. Extracting URLs and checking website status are explained. Findings reveal 40% of sites inactive. Trends in online startups are discussed.
Read original articleThe article discusses the process of scraping 6 years of Reddit posts in JSON format, focusing on self-promotion posts from specific subreddits. The author details the challenges faced in scraping Reddit data due to limitations on the number of posts that can be pulled and the cutoff after 14 days. An alternative data provider, Pushshift, is mentioned as a source for Reddit archives. The author also explains the method used to extract URLs from the posts and how to check if a website is operational by sending a HEAD request. Results show that around 40% of websites are abandoned or non-operational. The article concludes with insights on the trends observed in online business startups based on the data collected. Additionally, the author shares resources for further exploration of the data and expresses learnings from the scraping process.
Related
The demise of the mildly dynamic website (2022)
The evolution of websites from hand-crafted HTML to PHP enabled dynamic web apps with simple deployment. PHP's decline led to static site generators replacing mildly dynamic sites, shifting to JavaScript for features like comments.
Simple ways to find exposed sensitive information
Various methods to find exposed sensitive information are discussed, including search engine dorking, Github searches, and PublicWWW for hardcoded API keys. Risks of misconfigured AWS S3 buckets are highlighted, stressing data confidentiality.
Surfing the (Human-Made) Internet
The internet's evolution prompts a return to its human side, advocating for personal sites, niche content, and self-hosted platforms. Strategies include exploring blogrolls, creating link directories, and using alternative search engines. Embrace decentralized social media and RSS feeds for enriched online experiences.
Serving a billion web requests with boring code
The author shares insights from redesigning the Medicare Plan Compare website for the US government, focusing on stability and simplicity using technologies like Postgres, Golang, and React. Collaboration and dedication were key to success.
Show HN: Linkgrabs.com the Simple and Fast API to Fetch JavaScript Web Pages
LinkGrabs.com offers API service for fetching web pages at $0.0005 per grab. Grab bags cost $5.00 with 10,000 grabs. New accounts get 33 free grabs and can request 333 more. API returns client-side rendered content.
Related
The demise of the mildly dynamic website (2022)
The evolution of websites from hand-crafted HTML to PHP enabled dynamic web apps with simple deployment. PHP's decline led to static site generators replacing mildly dynamic sites, shifting to JavaScript for features like comments.
Simple ways to find exposed sensitive information
Various methods to find exposed sensitive information are discussed, including search engine dorking, Github searches, and PublicWWW for hardcoded API keys. Risks of misconfigured AWS S3 buckets are highlighted, stressing data confidentiality.
Surfing the (Human-Made) Internet
The internet's evolution prompts a return to its human side, advocating for personal sites, niche content, and self-hosted platforms. Strategies include exploring blogrolls, creating link directories, and using alternative search engines. Embrace decentralized social media and RSS feeds for enriched online experiences.
Serving a billion web requests with boring code
The author shares insights from redesigning the Medicare Plan Compare website for the US government, focusing on stability and simplicity using technologies like Postgres, Golang, and React. Collaboration and dedication were key to success.
Show HN: Linkgrabs.com the Simple and Fast API to Fetch JavaScript Web Pages
LinkGrabs.com offers API service for fetching web pages at $0.0005 per grab. Grab bags cost $5.00 with 10,000 grabs. New accounts get 33 free grabs and can request 333 more. API returns client-side rendered content.