Show HN: I'm making an AI scraper called FetchFox
FetchFox is an AI-powered Chrome extension that allows users to scrape data from websites by describing their needs in plain English, bypassing anti-scraping measures, and exporting results in CSV format.
Read original articleFetchFox is an AI-powered web scraping Chrome extension designed to extract data from various websites by allowing users to specify their data needs in plain English. The tool can bypass anti-scraping measures on platforms like LinkedIn and Facebook, making it effective for gathering information from complex HTML structures. Users can install the extension from the Chrome Web Store, configure scraping jobs by inputting their data requests, and then scrape data from desired web pages. The results can be downloaded in CSV format for further use. Examples of data extraction include retrieving job titles, summarizing work experience, and gathering project details from platforms like LinkedIn, GitHub, and Twitter. FetchFox aims to streamline the data collection process for tasks such as lead generation, research, and market analysis.
- FetchFox is an AI web scraper available as a Chrome extension.
- Users can extract data by describing their needs in plain English.
- The tool can bypass anti-scraping measures on major platforms.
- Data can be downloaded in CSV format for easy access.
- FetchFox is useful for lead generation, research, and market analysis.
Related
Show HN: G-Scraper, a GUI Web Scraper, Written in Python
The G-Scraper project is a Python GUI web scraper with features like request support, scraping multiple URLs/elements, logins, and data saving. Find details on GitHub for usage and contribution.
Instantly Turn Any Webpage into an API
InstantAPI.ai helps developers convert webpages into APIs for data extraction and automation, using ScrapingBee for HTML fetching and OpenAI for structuring, with customizable parameters and built-in error handling.
A web scraping CLI made for AI that is idempotent
The "Scrape It Now" GitHub repository offers an efficient web scraping tool with Azure integration, supporting parallel operations, ad-blocking, dynamic content handling, and easy configuration for developers.
Launch HN: MinusX (YC S24) – AI assistant for data tools like Jupyter/Metabase
MinusX is a free Chrome extension that enhances data analysis in Jupyter and Metabase by automating interactions with AI, allowing users to explore data and ask questions. Future monetization may include subscriptions.
Surfer: Centralize all your personal data from online platforms
Surfer centralizes personal data from various online platforms by scraping and exporting it to local storage. It is available for download, with community support through Discord and a roadmap for future enhancements.
- Many users question the name "FetchFox" since it currently only supports Chrome, not Firefox.
- There are concerns about potential violations of terms of service for sites like LinkedIn and Twitter, as well as the ethical implications of scraping.
- Users express a desire for clearer cost estimates and usage guidelines for the tool.
- Some commenters suggest innovative ideas for monetizing the scraping process, such as selling data to AI companies.
- Overall, there is a mix of interest in the tool's capabilities and caution regarding its implications for web scraping practices.
I also assume you don't check the robots.txt of websites?
I'm all for automating tedious work, but with all this (mostly AI-related) scraping, things are getting out of hand and creating a lot of headaches for developers maintaining heavily scraped sites.
related:
- "Dear AI Companies, instead of scraping OpenStreetMap, how about a $10k donation?" - https://news.ycombinator.com/item?id=41109926
- "Multiple AI companies bypassing web standard to scrape publisher sites" https://news.ycombinator.com/item?id=40750182
Also wondering how does the OP think about comparing themselves and standing out in the marketplace of seemingly bazillion options
This kind of stuff gets expensive fast.
1. Pay users to install a browser extension that scrapes social media content they browse. Or ask them in exchange for a service, e.g. "remember everything I browse and make it searchable", etc.
2. Ship the data you scrape to your servers.
3. Sell training data to companies at a discount.
This gets past the new rate limiters and blocks that Reddit and others have installed.
How does this work? Does it rely on GPT to extract the data or does it actually generate a bunch of selectors? If it's the former, then the results aren't reliable since it can just hallucinate whole results or even just parts.
> "By scraping raw text with AI, FetchFox lets you circumvent anti-scraping measures on sites like LinkedIn and Facebook. Even the the complicated HTML structures are possible to parse with FetchFox."
That sounds like a scold, but it's meant as an observation.
Now I will embed some implied scolding in what's to follow, but feel free to ignore that part; I wouldn't expect you to care.
But if you lack even a shred of human decency or morals, perhaps there's one more reason you might consider for spreading out your requests across sites and time, instead of absolutely pushing the abuse to the hilt, and that is that if you change what you are doing, and take a slower, more gentle approach, and I'm appealing to your selfishness here because clearly that is the only viable way into your head, then you will be less likely to cause countermeasures, and more likely to succeed.
Related
Show HN: G-Scraper, a GUI Web Scraper, Written in Python
The G-Scraper project is a Python GUI web scraper with features like request support, scraping multiple URLs/elements, logins, and data saving. Find details on GitHub for usage and contribution.
Instantly Turn Any Webpage into an API
InstantAPI.ai helps developers convert webpages into APIs for data extraction and automation, using ScrapingBee for HTML fetching and OpenAI for structuring, with customizable parameters and built-in error handling.
A web scraping CLI made for AI that is idempotent
The "Scrape It Now" GitHub repository offers an efficient web scraping tool with Azure integration, supporting parallel operations, ad-blocking, dynamic content handling, and easy configuration for developers.
Launch HN: MinusX (YC S24) – AI assistant for data tools like Jupyter/Metabase
MinusX is a free Chrome extension that enhances data analysis in Jupyter and Metabase by automating interactions with AI, allowing users to explore data and ask questions. Future monetization may include subscriptions.
Surfer: Centralize all your personal data from online platforms
Surfer centralizes personal data from various online platforms by scraping and exporting it to local storage. It is available for download, with community support through Discord and a roadmap for future enhancements.