Show HN: Crawlee for Python – a web scraping and browser automation library
Crawlee for Python is a powerful web scraping and browser automation library with features like scaling, proxy management, and Playwright integration. It's open source, supports Python 3.9+, and aids in efficient web scraping.
Read original articleCrawlee for Python is a web scraping and browser automation library designed to help users build reliable crawlers quickly. It offers features like automatic scaling, proxy management, and the ability to switch between HTTP and headless browsers effortlessly. The library is written in Python with type hints for better code completion and bug detection. Crawlee is developed by professionals who use it daily for scraping millions of pages. Users can try out Crawlee by installing it via pipx and using provided templates or by integrating it into their projects. The library leverages Playwright for browser automation and provides anti-blocking features and human-like fingerprints. Crawlee is open source and supports Python 3.9 or higher. It is recommended for those looking to streamline their web scraping processes and maintain scalable crawlers efficiently.
Related
Sloth search for Ruby Weekly – a 100 minute hack turned 20h open sauce project
Sloth Finder, a Ruby and Rails tool, curates niche articles on API and automation. It emphasizes simplicity, slow loading times, and plans to upgrade its tech stack for efficiency. Open source on GitHub.
Coverage at a Crossroads
Coverage.py is evolving to reduce execution-time overhead by adopting SlipCover's low-overhead approach for code coverage. Python 3.12's sys.monitoring improves line coverage, but challenges remain for branch coverage. SlipCover's method shows promise, requiring adjustments for optimal results.
Block AI bots, scrapers and crawlers with a single click
Cloudflare launches a feature to block AI bots easily, safeguarding content creators from unethical scraping. Identified bots include Bytespider, Amazonbot, ClaudeBot, and GPTBot. Cloudflare enhances bot detection to protect websites.
SpiderFoot automates OSINT for threat intelligence
SpiderFoot is an open-source intelligence tool on GitHub, with a web interface and command-line access. It aids in reconnaissance and identifying online vulnerabilities with over 200 modules. Installation details are on the SpiderFoot GitHub repository.
Organize Links with Precision and Speed
WebCull is a privacy-focused bookmark management tool with efficient features like folder creation, drag-and-drop organization, sharing via custom URLs, encryption, browser extensions, and planned multilingual support and AI integration.
As a concrete example: command-f for "tier" on https://crawlee.dev/python/docs/guides/proxy-management and tell me how anyone could possibly know what `tiered_proxy_urls: list[list[str]] | None = None` should contain and why?
The example should show how to literally find and target all data as in .csv .xlsx tables etc and actually download it.
Anyone can use requests and just get the text and grep for urls. I don't get it.
Remember: pick an example where you need to parse one thing to get 1000s of other things to then hit some other endpoints to then get the 3-5 things at each of those. Any example that doesn't look like that is not going to impress anyone.
I'm not even clear if this is saying it's a framework or actually some automation tool. Automation meaning it actually autodetects where to look.
Now I am using crawlee. Thanks. I will work on it, to better integrate into my project, however I already can tell it works flawlessly.
My project, with crawlee: https://github.com/rumca-js/Django-link-archive
I found the api a lot better than any python scraping api till date. However I am tempted to try out python with Crawlee.
The playwright integration with gotScraping makes the entire programming experience a breeze. My crawling and scraping involves all kinds of frontend rendered websites with a lot of modified XHR responses to be captured. And IT JUST WORKS!
Thanks a ton . I will definitely use the Apify platform to scale given the integration.
The code example on the front page has this:
`const data = await crawler.get_data()`
That looks like Javascript? Is there a missing underscore?
Nice work though.
I was trying to build a small Langchain based RAG based on internal documents but getting the documents from sharepoint/confluence (we have both) is very painful.
Related
Sloth search for Ruby Weekly – a 100 minute hack turned 20h open sauce project
Sloth Finder, a Ruby and Rails tool, curates niche articles on API and automation. It emphasizes simplicity, slow loading times, and plans to upgrade its tech stack for efficiency. Open source on GitHub.
Coverage at a Crossroads
Coverage.py is evolving to reduce execution-time overhead by adopting SlipCover's low-overhead approach for code coverage. Python 3.12's sys.monitoring improves line coverage, but challenges remain for branch coverage. SlipCover's method shows promise, requiring adjustments for optimal results.
Block AI bots, scrapers and crawlers with a single click
Cloudflare launches a feature to block AI bots easily, safeguarding content creators from unethical scraping. Identified bots include Bytespider, Amazonbot, ClaudeBot, and GPTBot. Cloudflare enhances bot detection to protect websites.
SpiderFoot automates OSINT for threat intelligence
SpiderFoot is an open-source intelligence tool on GitHub, with a web interface and command-line access. It aids in reconnaissance and identifying online vulnerabilities with over 200 modules. Installation details are on the SpiderFoot GitHub repository.
Organize Links with Precision and Speed
WebCull is a privacy-focused bookmark management tool with efficient features like folder creation, drag-and-drop organization, sharing via custom URLs, encryption, browser extensions, and planned multilingual support and AI integration.