September 3rd, 2024

Show HN: I'm making an AI scraper called FetchFox

FetchFox is an AI-powered Chrome extension that allows users to scrape data from websites by describing their needs in plain English, bypassing anti-scraping measures, and exporting results in CSV format.

Read original articleLink Icon
CuriosityConcernEnthusiasm
Show HN: I'm making an AI scraper called FetchFox

FetchFox is an AI-powered web scraping Chrome extension designed to extract data from various websites by allowing users to specify their data needs in plain English. The tool can bypass anti-scraping measures on platforms like LinkedIn and Facebook, making it effective for gathering information from complex HTML structures. Users can install the extension from the Chrome Web Store, configure scraping jobs by inputting their data requests, and then scrape data from desired web pages. The results can be downloaded in CSV format for further use. Examples of data extraction include retrieving job titles, summarizing work experience, and gathering project details from platforms like LinkedIn, GitHub, and Twitter. FetchFox aims to streamline the data collection process for tasks such as lead generation, research, and market analysis.

- FetchFox is an AI web scraper available as a Chrome extension.

- Users can extract data by describing their needs in plain English.

- The tool can bypass anti-scraping measures on major platforms.

- Data can be downloaded in CSV format for easy access.

- FetchFox is useful for lead generation, research, and market analysis.

AI: What people are saying
The comments on FetchFox highlight various concerns and suggestions regarding the AI-powered scraping tool.
  • Many users question the name "FetchFox" since it currently only supports Chrome, not Firefox.
  • There are concerns about potential violations of terms of service for sites like LinkedIn and Twitter, as well as the ethical implications of scraping.
  • Users express a desire for clearer cost estimates and usage guidelines for the tool.
  • Some commenters suggest innovative ideas for monetizing the scraping process, such as selling data to AI companies.
  • Overall, there is a mix of interest in the tool's capabilities and caution regarding its implications for web scraping practices.
Link Icon 21 comments
By @jackienotchan - 5 months
You have LinkedIn and Twitter examples, where you're very likely violating their TOS as they prohibit any scraping.

I also assume you don't check the robots.txt of websites?

I'm all for automating tedious work, but with all this (mostly AI-related) scraping, things are getting out of hand and creating a lot of headaches for developers maintaining heavily scraped sites.

related:

- "Dear AI Companies, instead of scraping OpenStreetMap, how about a $10k donation?" - https://news.ycombinator.com/item?id=41109926

- "Multiple AI companies bypassing web standard to scrape publisher sites" https://news.ycombinator.com/item?id=40750182

By @churros_train - 5 months
I am really curious how do people actually evaluate scrapers? There are so many options and I am dizzy just trying to read them...

Also wondering how does the OP think about comparing themselves and standing out in the marketplace of seemingly bazillion options

By @smcleod - 5 months
Out of interest - why is it called FetchFox - but it doesn't work on Firefox?
By @konata390 - 5 months
A bit off-topic, but why do people still use the GIF format? The "example-hn.gif" is 8.5MB, for 45 seconds of pretty stuttery video. I converted it to a similar looking VP9 video, and it was only 1.5MB, and with AV1 I got it down to 550KB with basically lossless quality.
By @CalRobert - 5 months
Of all the names to give something you built for chrome instead of Firefox…
By @bearjaws - 5 months
This is really cool, I'd just go ahead and double check that max spend limit on your OAI key before going to bed :)

This kind of stuff gets expensive fast.

By @benrules2 - 5 months
This is a really cool tool. Have been playing with similar scraping capabilities, so appreciate you sharing the source code as well. People who are saying "loads of scraping tools already exist" have likely not suffered through the current state of the art too, as heuristic based approaches absolutely pale in comparison to what an LLM can extract.
By @trog - 5 months
Would love something like this that allows users to trivially turn sites like Facebook/Twitter into RSS feeds. I'm sure this kinda thing is a useful stepping stone to doing that.
By @echelon - 5 months
Maybe instead of selling scraping to end users, invert the problem and sell data to AI companies:

1. Pay users to install a browser extension that scrapes social media content they browse. Or ask them in exchange for a service, e.g. "remember everything I browse and make it searchable", etc.

2. Ship the data you scrape to your servers.

3. Sell training data to companies at a discount.

This gets past the new rate limiters and blocks that Reddit and others have installed.

By @mkroman - 4 months
I've wanted to make something like this myself, so thanks and good job!

How does this work? Does it rely on GPT to extract the data or does it actually generate a bunch of selectors? If it's the former, then the results aren't reliable since it can just hallucinate whole results or even just parts.

By @HenryBemis - 5 months
Hmmm... assume one can harvest the financial news from some big website, then correlate it with market (historic) movements (when articles stating X Y and Z are posted, then after 24h Gold price dropped)(with a 80%-90% rate)(that could be used to predict and trade 'regularly')
By @smcin - 5 months
Interesting. How long did it take to figure out how to do this with ChatGPT?

> "By scraping raw text with AI, FetchFox lets you circumvent anti-scraping measures on sites like LinkedIn and Facebook. Even the the complicated HTML structures are possible to parse with FetchFox."

By @ratata - 5 months
Nice! Any plans for Firefox support?
By @aayothered - 5 months
This is really cool, I don't know when I will need something like this, but I am sure the day will come! I hope the tech and the policies that govern scrapers are in place at that time! Best of luck
By @SomewhatLikely - 5 months
Can I recommend you provide some cost estimates next to the examples for using your own key? I tried a few custom extractions and then checked my usage dashboard and it was already over $2.
By @starfallg - 5 months
This is interesting. How much difference is it (in cost, quality) by using this approach compared to taking a image capture of the page and then sending it off to a multi modal LLM?
By @platorob - 5 months
Useful! Keep up the good work!
By @hydrogenpolo - 5 months
Sweet ill take a look!
By @platorob - 5 months
Gm gm! This is a good tool. Tried it out to scrape for email from various professional sites. Thumbs up
By @natch - 5 months
Forgetting about law and copyright and robots.txt because hey most scrapers have to rely on fair use anyway, you even forget about basic consideration for the sites you hammer.

That sounds like a scold, but it's meant as an observation.

Now I will embed some implied scolding in what's to follow, but feel free to ignore that part; I wouldn't expect you to care.

But if you lack even a shred of human decency or morals, perhaps there's one more reason you might consider for spreading out your requests across sites and time, instead of absolutely pushing the abuse to the hilt, and that is that if you change what you are doing, and take a slower, more gentle approach, and I'm appealing to your selfishness here because clearly that is the only viable way into your head, then you will be less likely to cause countermeasures, and more likely to succeed.