July 9th, 2024

Show HN: Crawlee for Python – a web scraping and browser automation library

Crawlee for Python is a powerful web scraping and browser automation library with features like scaling, proxy management, and Playwright integration. It's open source, supports Python 3.9+, and aids in efficient web scraping.

Read original article

Show HN: Crawlee for Python – a web scraping and browser automation library

Crawlee for Python is a web scraping and browser automation library designed to help users build reliable crawlers quickly. It offers features like automatic scaling, proxy management, and the ability to switch between HTTP and headless browsers effortlessly. The library is written in Python with type hints for better code completion and bug detection. Crawlee is developed by professionals who use it daily for scraping millions of pages. Users can try out Crawlee by installing it via pipx and using provided templates or by integrating it into their projects. The library leverages Playwright for browser automation and provides anti-blocking features and human-like fingerprints. Crawlee is open source and supports Python 3.9 or higher. It is recommended for those looking to streamline their web scraping processes and maintain scalable crawlers efficiently.

Sloth search for Ruby Weekly – a 100 minute hack turned 20h open sauce project

Sloth Finder, a Ruby and Rails tool, curates niche articles on API and automation. It emphasizes simplicity, slow loading times, and plans to upgrade its tech stack for efficiency. Open source on GitHub.

Coverage at a Crossroads

Coverage.py is evolving to reduce execution-time overhead by adopting SlipCover's low-overhead approach for code coverage. Python 3.12's sys.monitoring improves line coverage, but challenges remain for branch coverage. SlipCover's method shows promise, requiring adjustments for optimal results.

Block AI bots, scrapers and crawlers with a single click

Cloudflare launches a feature to block AI bots easily, safeguarding content creators from unethical scraping. Identified bots include Bytespider, Amazonbot, ClaudeBot, and GPTBot. Cloudflare enhances bot detection to protect websites.

SpiderFoot automates OSINT for threat intelligence

SpiderFoot is an open-source intelligence tool on GitHub, with a web interface and command-line access. It aids in reconnaissance and identifying online vulnerabilities with over 200 modules. Installation details are on the SpiderFoot GitHub repository.

Organize Links with Precision and Speed

WebCull is a privacy-focused bookmark management tool with efficient features like folder creation, drag-and-drop organization, sharing via custom URLs, encryption, browser extensions, and planned multilingual support and AI integration.

20 comments

By @mdaniel - 9 months

You'll want to prioritize documenting the existing features, since it's no good having a super awesome full stack web scraping platform if only you can use it. I ordinarily would default to a "read the source" response but your cutesy coding style makes that a non-starter

As a concrete example: command-f for "tier" on https://crawlee.dev/python/docs/guides/proxy-management and tell me how anyone could possibly know what `tiered_proxy_urls: list[list[str]] | None = None` should contain and why?

By @Findecanor - 9 months

Does it have support for web scraping opt-out protocols, such as Robots.txt, HTTP and content tags? These are getting more important now, especially in the EU after the DSM directive.

By @nobodywillobsrv - 9 months

I don't really understand it. Tried it on some fund site and it didn't really do much besides apparently grepping for links.

The example should show how to literally find and target all data as in .csv .xlsx tables etc and actually download it.

Anyone can use requests and just get the text and grep for urls. I don't get it.

Remember: pick an example where you need to parse one thing to get 1000s of other things to then hit some other endpoints to then get the 3-5 things at each of those. Any example that doesn't look like that is not going to impress anyone.

I'm not even clear if this is saying it's a framework or actually some automation tool. Automation meaning it actually autodetects where to look.

By @renegat0x0 - 9 months

I have been running my project with selenium for some time.

Now I am using crawlee. Thanks. I will work on it, to better integrate into my project, however I already can tell it works flawlessly.

My project, with crawlee: https://github.com/rumca-js/Django-link-archive

By @c0brac0bra - 9 months

Wanted to say thanks for apify/crawlee. I'm a long-time node.js user and your library has worked better than all the others I've tried.

By @intev - 9 months

How is this different from Scrapy?

By @ranedk - 9 months

I found crawlee a few days ago while figuring out a stack for a project. I wanted a python library but found crawlee with typescript so much easier that I ended up coding the entire project in less than a week in Typescript+Crawlee+Playwright

I found the api a lot better than any python scraping api till date. However I am tempted to try out python with Crawlee.

The playwright integration with gotScraping makes the entire programming experience a breeze. My crawling and scraping involves all kinds of frontend rendered websites with a lot of modified XHR responses to be captured. And IT JUST WORKS!

Thanks a ton . I will definitely use the Apify platform to scale given the integration.

By @marban - 9 months

Nice list, but what would be the arguments for switching over from other libraries? I’ve built my own crawler over time, but from what I see, there’s nothing truly unique.

By @VagabundoP - 9 months

Looks nice, and modern python.

The code example on the front page has this:

`const data = await crawler.get_data()`

That looks like Javascript? Is there a missing underscore?

By @fforflo - 9 months

I'd suggest bringing more code snippets from the test cases to documentation as examples.

Nice work though.

By @manishsharan - 9 months

Can this work on intranet sites like sharepoint or confluence , which require employee SSO ?

I was trying to build a small Langchain based RAG based on internal documents but getting the documents from sharepoint/confluence (we have both) is very painful.

By @holoduke - 9 months

Does it have event listeners to wait for specific elements based on certain pattern matches. One reason i am still using phantomjs is because it simulates the entire browser and you can compile your own webkit in it.

By @barrenko - 9 months

Pretty cool, and any scraping tool is really welcome - I'll try it out for my personal project. At the monment, due to AI, scraping is like selling shovels during a gold rush.

By @ijustlovemath - 9 months

Do you have any plans to monetize this? How are you supporting development?

By @renegat0x0 - 9 months

Can it be used to obtain RSS contents? Most of examples focus on html

By @bmitc - 9 months

Can you use this to auto-logon to systems?

By @localfirst - 9 months

in one sentence, what does this do that existing web scraping and browser automation doesn't do?

By @thelastgallon - 9 months

I wonder if there are any AI tools that do web scraping for you without having to write any code?

Show HN: Crawlee for Python – a web scraping and browser automation library

Related

Sloth search for Ruby Weekly – a 100 minute hack turned 20h open sauce project

Coverage at a Crossroads

Block AI bots, scrapers and crawlers with a single click

SpiderFoot automates OSINT for threat intelligence

Organize Links with Precision and Speed

Related

Sloth search for Ruby Weekly – a 100 minute hack turned 20h open sauce project

Coverage at a Crossroads

Block AI bots, scrapers and crawlers with a single click

SpiderFoot automates OSINT for threat intelligence

Organize Links with Precision and Speed