September 3rd, 2024

Show HN: I'm making an AI scraper called FetchFox

FetchFox is an AI-powered Chrome extension that allows users to scrape data from websites by describing their needs in plain English, bypassing anti-scraping measures, and exporting results in CSV format.

Read original article

CuriosityConcernEnthusiasm

Show HN: I'm making an AI scraper called FetchFox

FetchFox is an AI-powered web scraping Chrome extension designed to extract data from various websites by allowing users to specify their data needs in plain English. The tool can bypass anti-scraping measures on platforms like LinkedIn and Facebook, making it effective for gathering information from complex HTML structures. Users can install the extension from the Chrome Web Store, configure scraping jobs by inputting their data requests, and then scrape data from desired web pages. The results can be downloaded in CSV format for further use. Examples of data extraction include retrieving job titles, summarizing work experience, and gathering project details from platforms like LinkedIn, GitHub, and Twitter. FetchFox aims to streamline the data collection process for tasks such as lead generation, research, and market analysis.

- FetchFox is an AI web scraper available as a Chrome extension.

- Users can extract data by describing their needs in plain English.

- The tool can bypass anti-scraping measures on major platforms.

- Data can be downloaded in CSV format for easy access.

- FetchFox is useful for lead generation, research, and market analysis.

Show HN: G-Scraper, a GUI Web Scraper, Written in Python

The G-Scraper project is a Python GUI web scraper with features like request support, scraping multiple URLs/elements, logins, and data saving. Find details on GitHub for usage and contribution.

Instantly Turn Any Webpage into an API

InstantAPI.ai helps developers convert webpages into APIs for data extraction and automation, using ScrapingBee for HTML fetching and OpenAI for structuring, with customizable parameters and built-in error handling.

A web scraping CLI made for AI that is idempotent

The "Scrape It Now" GitHub repository offers an efficient web scraping tool with Azure integration, supporting parallel operations, ad-blocking, dynamic content handling, and easy configuration for developers.

Launch HN: MinusX (YC S24) – AI assistant for data tools like Jupyter/Metabase

MinusX is a free Chrome extension that enhances data analysis in Jupyter and Metabase by automating interactions with AI, allowing users to explore data and ask questions. Future monetization may include subscriptions.

Surfer: Centralize all your personal data from online platforms

Surfer centralizes personal data from various online platforms by scraping and exporting it to local storage. It is available for download, with community support through Discord and a roadmap for future enhancements.

AI: What people are saying

The comments on FetchFox highlight various concerns and suggestions regarding the AI-powered scraping tool.

Many users question the name "FetchFox" since it currently only supports Chrome, not Firefox.
There are concerns about potential violations of terms of service for sites like LinkedIn and Twitter, as well as the ethical implications of scraping.
Users express a desire for clearer cost estimates and usage guidelines for the tool.
Some commenters suggest innovative ideas for monetizing the scraping process, such as selling data to AI companies.
Overall, there is a mix of interest in the tool's capabilities and caution regarding its implications for web scraping practices.

21 comments

By @jackienotchan - 8 months

You have LinkedIn and Twitter examples, where you're very likely violating their TOS as they prohibit any scraping.

I also assume you don't check the robots.txt of websites?

I'm all for automating tedious work, but with all this (mostly AI-related) scraping, things are getting out of hand and creating a lot of headaches for developers maintaining heavily scraped sites.

- "Dear AI Companies, instead of scraping OpenStreetMap, how about a $10k donation?" - https://news.ycombinator.com/item?id=41109926

- "Multiple AI companies bypassing web standard to scrape publisher sites" https://news.ycombinator.com/item?id=40750182

By @churros_train - 8 months

I am really curious how do people actually evaluate scrapers? There are so many options and I am dizzy just trying to read them...

Also wondering how does the OP think about comparing themselves and standing out in the marketplace of seemingly bazillion options

By @smcleod - 8 months

Out of interest - why is it called FetchFox - but it doesn't work on Firefox?

By @konata390 - 8 months

A bit off-topic, but why do people still use the GIF format? The "example-hn.gif" is 8.5MB, for 45 seconds of pretty stuttery video. I converted it to a similar looking VP9 video, and it was only 1.5MB, and with AV1 I got it down to 550KB with basically lossless quality.

By @CalRobert - 8 months

Of all the names to give something you built for chrome instead of Firefox…

By @bearjaws - 8 months

This is really cool, I'd just go ahead and double check that max spend limit on your OAI key before going to bed :)

This kind of stuff gets expensive fast.

By @benrules2 - 8 months

This is a really cool tool. Have been playing with similar scraping capabilities, so appreciate you sharing the source code as well. People who are saying "loads of scraping tools already exist" have likely not suffered through the current state of the art too, as heuristic based approaches absolutely pale in comparison to what an LLM can extract.

By @trog - 8 months

Would love something like this that allows users to trivially turn sites like Facebook/Twitter into RSS feeds. I'm sure this kinda thing is a useful stepping stone to doing that.

By @echelon - 8 months

Maybe instead of selling scraping to end users, invert the problem and sell data to AI companies:

1. Pay users to install a browser extension that scrapes social media content they browse. Or ask them in exchange for a service, e.g. "remember everything I browse and make it searchable", etc.

2. Ship the data you scrape to your servers.

3. Sell training data to companies at a discount.

This gets past the new rate limiters and blocks that Reddit and others have installed.

By @mkroman - 8 months

I've wanted to make something like this myself, so thanks and good job!

How does this work? Does it rely on GPT to extract the data or does it actually generate a bunch of selectors? If it's the former, then the results aren't reliable since it can just hallucinate whole results or even just parts.

By @HenryBemis - 8 months

Hmmm... assume one can harvest the financial news from some big website, then correlate it with market (historic) movements (when articles stating X Y and Z are posted, then after 24h Gold price dropped)(with a 80%-90% rate)(that could be used to predict and trade 'regularly')

By @smcin - 8 months

Interesting. How long did it take to figure out how to do this with ChatGPT?

> "By scraping raw text with AI, FetchFox lets you circumvent anti-scraping measures on sites like LinkedIn and Facebook. Even the the complicated HTML structures are possible to parse with FetchFox."

By @ratata - 8 months

Nice! Any plans for Firefox support?

By @aayothered - 8 months

This is really cool, I don't know when I will need something like this, but I am sure the day will come! I hope the tech and the policies that govern scrapers are in place at that time! Best of luck

By @SomewhatLikely - 8 months

Can I recommend you provide some cost estimates next to the examples for using your own key? I tried a few custom extractions and then checked my usage dashboard and it was already over $2.

By @starfallg - 8 months

This is interesting. How much difference is it (in cost, quality) by using this approach compared to taking a image capture of the page and then sending it off to a multi modal LLM?

By @platorob - 8 months

Useful! Keep up the good work!

By @hydrogenpolo - 8 months

Sweet ill take a look!

By @platorob - 8 months

Gm gm! This is a good tool. Tried it out to scrape for email from various professional sites. Thumbs up

By @natch - 8 months

Forgetting about law and copyright and robots.txt because hey most scrapers have to rely on fair use anyway, you even forget about basic consideration for the sites you hammer.

That sounds like a scold, but it's meant as an observation.

Now I will embed some implied scolding in what's to follow, but feel free to ignore that part; I wouldn't expect you to care.

But if you lack even a shred of human decency or morals, perhaps there's one more reason you might consider for spreading out your requests across sites and time, instead of absolutely pushing the abuse to the hilt, and that is that if you change what you are doing, and take a slower, more gentle approach, and I'm appealing to your selfishness here because clearly that is the only viable way into your head, then you will be less likely to cause countermeasures, and more likely to succeed.

Show HN: G-Scraper, a GUI Web Scraper, Written in Python

The G-Scraper project is a Python GUI web scraper with features like request support, scraping multiple URLs/elements, logins, and data saving. Find details on GitHub for usage and contribution.

Show HN: I'm making an AI scraper called FetchFox

Related

Show HN: G-Scraper, a GUI Web Scraper, Written in Python

Instantly Turn Any Webpage into an API

A web scraping CLI made for AI that is idempotent

Launch HN: MinusX (YC S24) – AI assistant for data tools like Jupyter/Metabase

Surfer: Centralize all your personal data from online platforms

Related

Show HN: G-Scraper, a GUI Web Scraper, Written in Python

Instantly Turn Any Webpage into an API

A web scraping CLI made for AI that is idempotent

Launch HN: MinusX (YC S24) – AI assistant for data tools like Jupyter/Metabase

Surfer: Centralize all your personal data from online platforms