August 16th, 2024

A web scraping CLI made for AI that is idempotent

The "Scrape It Now" GitHub repository offers an efficient web scraping tool with Azure integration, supporting parallel operations, ad-blocking, dynamic content handling, and easy configuration for developers.

Read original article

A web scraping CLI made for AI that is idempotent

The GitHub repository "Scrape It Now" is a web scraping tool that emphasizes efficiency and scalability. It features a decoupled architecture utilizing Azure Queue Storage for task management, enabling idempotent operations that allow for parallel execution without re-scraping unchanged pages. Scraped data is stored in Azure Blob Storage, and the tool includes ad-blocking capabilities to minimize network costs. It effectively handles dynamic content by using Playwright to load JavaScript, while also ensuring user anonymity through random user agents and viewport sizes. Additionally, it automatically creates a searchable index of the scraped content using Azure AI Search. To use the tool, users can run jobs to scrape websites or index scraped content by setting the necessary environment variables for Azure services. Advanced users can simplify configuration by sourcing environment variables from a `.env` file. This tool is particularly beneficial for developers aiming to scrape and index web content efficiently while utilizing Azure's cloud infrastructure.

- "Scrape It Now" is designed for efficient and scalable web scraping.

- It features ad-blocking and dynamic content handling with Playwright.

- The tool supports idempotent operations for parallel scraping.

- Scraped data is stored in Azure Blob Storage and indexed using Azure AI Search.

- Users can configure the tool easily with environment variables or a `.env` file.

Cloudflare rolls out feature for blocking AI companies' web scrapers

Cloudflare introduces a new feature to block AI web scrapers, available in free and paid tiers. It detects and combats automated extraction attempts, enhancing website security against unauthorized scraping by AI companies.

Storing Scraped Data in an SQLite Database on GitHub

The article explains Git scraping, saving data to a Git repository with GitHub Actions. Benefits include historical tracking and using SQLite for storage. Limitations and Datasette for data visualization are discussed.

Show HN: G-Scraper, a GUI Web Scraper, Written in Python

The G-Scraper project is a Python GUI web scraper with features like request support, scraping multiple URLs/elements, logins, and data saving. Find details on GitHub for usage and contribution.

Tracking supermarket prices with Playwright

In December 2022, the author created a price tracking website for Greek supermarkets, utilizing Playwright for scraping, cloud services for automation, and Tailscale to bypass IP restrictions, optimizing for efficiency.

Instantly Turn Any Webpage into an API

InstantAPI.ai helps developers convert webpages into APIs for data extraction and automation, using ScrapingBee for HTML fetching and OpenAI for structuring, with customizable parameters and built-in error handling.

7 comments

By @renegat0x0 - 9 months

I have a similar project. I scrape pages only to obtain page meta. It can use selenium, crawlee, i will also add puppeteer later.

The project is quite big, has mamy features.

It is my internet command center. I used it to check what's news on the internet.

https://github.com/rumca-js/Django-link-archive

By @usernamed7 - 9 months

Not to discount any actual utility or innovation here, but I was wondering "why would you be hard coding to all these azure services?" then I saw the author is a solutions architect at microsoft.

so this is likely part of Microsoft's AI strategy to lure developers in and create dependence. Doesn't mean it can't also be interesting/good, but it's important context to this project's purpose and goals.

By @cha-d - 9 months

Am I right in thinking running this regularly from your computer at home will start causing you to start receiving more Captchas over time? If so, what are some other options?

By @wcallahan - 9 months

Nice work. Would love a similar repository for Google cloud’s equivalent services!

Or a PR on this that accomplishes the same, as @clemlesne mentioned.

By @katella - 9 months

Does it just scrape all pages of a site?

By @mrdw - 9 months

why so dependent on azure?

"Decoupled architecture with Azure Queue Storage"

"Scraped content is stored in Azure Blob Storage"

"Indexed content is semantically searchable with Azure AI Search"

By @bbor - 9 months

lol I love the cheeky `[ ] respect robots.txt` mention. I was all worried about this for my own system, but shocked to find out there’s a ton of projects openly built around breaking the law (/social protocol). Is the justification just the same as pirating entertainment, ie “big companies are bad” and/or “IP is unjustified”?

This one in particular doesn’t fit my exact use case I don’t think, but I love the repo, very clearly explained. Well done! I hadn’t even thought about ads until just now, that’s an interesting problem…

Cloudflare rolls out feature for blocking AI companies' web scrapers

Storing Scraped Data in an SQLite Database on GitHub

Show HN: G-Scraper, a GUI Web Scraper, Written in Python

The G-Scraper project is a Python GUI web scraper with features like request support, scraping multiple URLs/elements, logins, and data saving. Find details on GitHub for usage and contribution.

A web scraping CLI made for AI that is idempotent

Related

Cloudflare rolls out feature for blocking AI companies' web scrapers

Storing Scraped Data in an SQLite Database on GitHub

Show HN: G-Scraper, a GUI Web Scraper, Written in Python

Tracking supermarket prices with Playwright

Instantly Turn Any Webpage into an API

Related

Cloudflare rolls out feature for blocking AI companies' web scrapers

Storing Scraped Data in an SQLite Database on GitHub

Show HN: G-Scraper, a GUI Web Scraper, Written in Python

Tracking supermarket prices with Playwright

Instantly Turn Any Webpage into an API