July 9th, 2024

Show HN: Crawlee for Python – a web scraping and browser automation library

Crawlee for Python is a powerful web scraping and browser automation library with features like scaling, proxy management, and Playwright integration. It's open source, supports Python 3.9+, and aids in efficient web scraping.

Read original articleLink Icon
Show HN: Crawlee for Python – a web scraping and browser automation library

Crawlee for Python is a web scraping and browser automation library designed to help users build reliable crawlers quickly. It offers features like automatic scaling, proxy management, and the ability to switch between HTTP and headless browsers effortlessly. The library is written in Python with type hints for better code completion and bug detection. Crawlee is developed by professionals who use it daily for scraping millions of pages. Users can try out Crawlee by installing it via pipx and using provided templates or by integrating it into their projects. The library leverages Playwright for browser automation and provides anti-blocking features and human-like fingerprints. Crawlee is open source and supports Python 3.9 or higher. It is recommended for those looking to streamline their web scraping processes and maintain scalable crawlers efficiently.

Link Icon 20 comments
By @mdaniel - 3 months
You'll want to prioritize documenting the existing features, since it's no good having a super awesome full stack web scraping platform if only you can use it. I ordinarily would default to a "read the source" response but your cutesy coding style makes that a non-starter

As a concrete example: command-f for "tier" on https://crawlee.dev/python/docs/guides/proxy-management and tell me how anyone could possibly know what `tiered_proxy_urls: list[list[str]] | None = None` should contain and why?

By @Findecanor - 3 months
Does it have support for web scraping opt-out protocols, such as Robots.txt, HTTP and content tags? These are getting more important now, especially in the EU after the DSM directive.
By @nobodywillobsrv - 3 months
I don't really understand it. Tried it on some fund site and it didn't really do much besides apparently grepping for links.

The example should show how to literally find and target all data as in .csv .xlsx tables etc and actually download it.

Anyone can use requests and just get the text and grep for urls. I don't get it.

Remember: pick an example where you need to parse one thing to get 1000s of other things to then hit some other endpoints to then get the 3-5 things at each of those. Any example that doesn't look like that is not going to impress anyone.

I'm not even clear if this is saying it's a framework or actually some automation tool. Automation meaning it actually autodetects where to look.

By @renegat0x0 - 3 months
I have been running my project with selenium for some time.

Now I am using crawlee. Thanks. I will work on it, to better integrate into my project, however I already can tell it works flawlessly.

My project, with crawlee: https://github.com/rumca-js/Django-link-archive

By @c0brac0bra - 3 months
Wanted to say thanks for apify/crawlee. I'm a long-time node.js user and your library has worked better than all the others I've tried.
By @intev - 3 months
How is this different from Scrapy?
By @ranedk - 3 months
I found crawlee a few days ago while figuring out a stack for a project. I wanted a python library but found crawlee with typescript so much easier that I ended up coding the entire project in less than a week in Typescript+Crawlee+Playwright

I found the api a lot better than any python scraping api till date. However I am tempted to try out python with Crawlee.

The playwright integration with gotScraping makes the entire programming experience a breeze. My crawling and scraping involves all kinds of frontend rendered websites with a lot of modified XHR responses to be captured. And IT JUST WORKS!

Thanks a ton . I will definitely use the Apify platform to scale given the integration.

By @marban - 3 months
Nice list, but what would be the arguments for switching over from other libraries? I’ve built my own crawler over time, but from what I see, there’s nothing truly unique.
By @VagabundoP - 3 months
Looks nice, and modern python.

The code example on the front page has this:

`const data = await crawler.get_data()`

That looks like Javascript? Is there a missing underscore?

By @fforflo - 3 months
I'd suggest bringing more code snippets from the test cases to documentation as examples.

Nice work though.

By @manishsharan - 3 months
Can this work on intranet sites like sharepoint or confluence , which require employee SSO ?

I was trying to build a small Langchain based RAG based on internal documents but getting the documents from sharepoint/confluence (we have both) is very painful.

By @holoduke - 3 months
Does it have event listeners to wait for specific elements based on certain pattern matches. One reason i am still using phantomjs is because it simulates the entire browser and you can compile your own webkit in it.
By @barrenko - 3 months
Pretty cool, and any scraping tool is really welcome - I'll try it out for my personal project. At the monment, due to AI, scraping is like selling shovels during a gold rush.
By @ijustlovemath - 3 months
Do you have any plans to monetize this? How are you supporting development?
By @renegat0x0 - 3 months
Can it be used to obtain RSS contents? Most of examples focus on html
By @bmitc - 3 months
Can you use this to auto-logon to systems?
By @localfirst - 3 months
in one sentence, what does this do that existing web scraping and browser automation doesn't do?
By @thelastgallon - 3 months
I wonder if there are any AI tools that do web scraping for you without having to write any code?