August 16th, 2024

A web scraping CLI made for AI that is idempotent

The "Scrape It Now" GitHub repository offers an efficient web scraping tool with Azure integration, supporting parallel operations, ad-blocking, dynamic content handling, and easy configuration for developers.

Read original articleLink Icon
A web scraping CLI made for AI that is idempotent

The GitHub repository "Scrape It Now" is a web scraping tool that emphasizes efficiency and scalability. It features a decoupled architecture utilizing Azure Queue Storage for task management, enabling idempotent operations that allow for parallel execution without re-scraping unchanged pages. Scraped data is stored in Azure Blob Storage, and the tool includes ad-blocking capabilities to minimize network costs. It effectively handles dynamic content by using Playwright to load JavaScript, while also ensuring user anonymity through random user agents and viewport sizes. Additionally, it automatically creates a searchable index of the scraped content using Azure AI Search. To use the tool, users can run jobs to scrape websites or index scraped content by setting the necessary environment variables for Azure services. Advanced users can simplify configuration by sourcing environment variables from a `.env` file. This tool is particularly beneficial for developers aiming to scrape and index web content efficiently while utilizing Azure's cloud infrastructure.

- "Scrape It Now" is designed for efficient and scalable web scraping.

- It features ad-blocking and dynamic content handling with Playwright.

- The tool supports idempotent operations for parallel scraping.

- Scraped data is stored in Azure Blob Storage and indexed using Azure AI Search.

- Users can configure the tool easily with environment variables or a `.env` file.

Link Icon 7 comments
By @renegat0x0 - 4 months
I have a similar project. I scrape pages only to obtain page meta. It can use selenium, crawlee, i will also add puppeteer later.

The project is quite big, has mamy features.

It is my internet command center. I used it to check what's news on the internet.

https://github.com/rumca-js/Django-link-archive

By @usernamed7 - 4 months
Not to discount any actual utility or innovation here, but I was wondering "why would you be hard coding to all these azure services?" then I saw the author is a solutions architect at microsoft.

so this is likely part of Microsoft's AI strategy to lure developers in and create dependence. Doesn't mean it can't also be interesting/good, but it's important context to this project's purpose and goals.

By @cha-d - 4 months
Am I right in thinking running this regularly from your computer at home will start causing you to start receiving more Captchas over time? If so, what are some other options?
By @wcallahan - 4 months
Nice work. Would love a similar repository for Google cloud’s equivalent services!

Or a PR on this that accomplishes the same, as @clemlesne mentioned.

By @katella - 4 months
Does it just scrape all pages of a site?
By @mrdw - 4 months
why so dependent on azure?

"Decoupled architecture with Azure Queue Storage"

"Scraped content is stored in Azure Blob Storage"

"Indexed content is semantically searchable with Azure AI Search"

By @bbor - 4 months
lol I love the cheeky `[ ] respect robots.txt` mention. I was all worried about this for my own system, but shocked to find out there’s a ton of projects openly built around breaking the law (/social protocol). Is the justification just the same as pirating entertainment, ie “big companies are bad” and/or “IP is unjustified”?

This one in particular doesn’t fit my exact use case I don’t think, but I love the repo, very clearly explained. Well done! I hadn’t even thought about ads until just now, that’s an interesting problem…