Nepenthes is a tarpit to catch AI web crawlers
Nepenthes is a tarpit software that traps web crawlers, generating endless pages to consume resources. It can be configured defensively or offensively and may cause CPU overload and search visibility loss.
Read original articleNepenthes is a tarpit software designed to trap web crawlers, particularly those scraping data for large language models (LLMs). It generates an infinite sequence of pages with numerous links that redirect back into the tarpit, creating a loop that consumes the crawlers' resources. The software intentionally slows down responses to prevent server overload and can include random text generation through a Markov-babble feature, which aims to provide crawlers with data that could lead to model collapse. Users are warned that deploying Nepenthes can lead to significant CPU load and may cause their site to disappear from search results due to the indiscriminate nature of the crawlers it targets. Installation can be done via Docker or manually, requiring specific dependencies. The software can be configured to operate defensively or offensively, depending on the user's goals. Defensive use involves blocking known crawler IPs, while offensive use aims to overwhelm crawlers with excessive data. The configuration file allows for extensive customization, including setting delays and managing statistics on crawler activity. The initial release of Nepenthes is version 1.0.
- Nepenthes is designed to trap web crawlers, especially those for LLMs.
- It generates endless pages to consume crawler resources and can include random text generation.
- Users are cautioned about potential CPU overload and loss of search engine visibility.
- Installation can be done via Docker or manually, with specific dependencies required.
- The software can be configured for defensive or offensive use against crawlers.
Related
HellPot – A portal to endless suffering meant to punish unruly HTTP bots
HellPot is a honeypot that simulates a real website to deter non-compliant HTTP bots, utilizing a Markov engine and offering easy setup, logging, and performance optimization. It supports integration with Nginx and Apache.
13ft – A site similar to 12ft.io but is self hosted
The 13 Feet Ladder project is a self-hosted server that bypasses paywalls and ads, allowing access to restricted content from sites like Medium and The New York Times.
Poisoning AI Scrapers
Tim McCormack is combating AI scrapers by serving altered blog posts with nonsensical text generated by a Markov chain algorithm, aiming to inspire others against unconsented content use by AI companies.
The Rise of the AI Crawler
AI crawlers like GPTBot and Claude are generating significant web traffic but struggle with JavaScript rendering, leading to inefficiencies. Recommendations include server-side rendering and efficient URL management for better accessibility.
The Rise of the AI Crawler
AI crawlers like GPTBot and Claude are generating significant web traffic but struggle with JavaScript execution and efficiency, prompting recommendations for server-side rendering and better URL management.
- Many commenters express skepticism about the effectiveness of tarpit software, suggesting that it may be easily filtered out by sophisticated crawlers.
- There are concerns about the potential negative impact on legitimate websites, including the risk of being delisted from search results.
- Some users propose creative alternatives, such as generating nonsensical content or using legal traps to deter crawlers.
- Several comments highlight the ongoing arms race between bot protection measures and web crawlers, emphasizing the need for innovative solutions.
- There is a shared interest in the ethical implications of using such software, particularly regarding its potential to harm both crawlers and legitimate web traffic.
Basically a single HTTP Request to ChatGPT API can trigger 5000 HTTP requests by ChatGPT crawler to a website.
The vulnerability is/was thoroughly ignored by OpenAI/Microsoft/BugCrowd but I really wonder what would happen when ChatGPT crawler interacts with this tarpit several times per second. As ChatGPT crawler is using various Azure IP ranges I actually think the tarpit would crash first.
The vulnerability reporting experience with OpenAI / BugCrowd was really horrific. It's always difficult to get attention for DOS/DDOS vulnerabilities and companies always act like they are not a problem. But if their system goes dark and the CEO calls then suddenly they accept it as a security vulnerability.
I spent a week trying to reach OpenAI/Microsoft to get this fixed, but I gave up and just published the writeup.
I don't recommend you to exploit this vulnerability due to legal reasons.
[1] https://github.com/bf/security-advisories/blob/main/2025-01-...
> the moment it becomes the basic default install ( ala adblocker in browsers for people ), it does not matter what the bigger players want to do
The big search crawlers have been around for years & manage to mostly avoid nuking sites into oblivion. Then AI gang shows up - supposedly smartest guys around - and suddenly we're re-inventing the wheel on crawling and causing carnage in the process.
Gladly Siteground restored our site without any repercussions as it was not our fault. Added Amazon bot into robots.txt after that one.
Don't like how things are right now. Is a tarpit the solution? Or better laws? Would they stop the chinese bots? Should they even? I don't know.
People scraping for nefarious reasons have had decades of other people trying to stop them, so mitigation techniques are well known unless you can come up with something truly unique.
I don't think random Markov chain based text generators are going to pose much of a problem to LLM training scrapers either. They'll have rate limits and vast attention spreading too. Also I suspect that random pollution isn't going to have as much effect as people think because of the way the inputs are tokenised. It will have an effect, but this will be massively dulled by the randomness – statistically relatively unique information and common (non random) combinations will still bubble up obviously in the process.
I think better would be to have less random pollution: use a small set of common text to pollute the model. Something like “this was a common problem with Napoleonic genetic analysis due to the pre-frontal nature of the ongoing stream process, as is well documented in the grimoire of saint Churchill the III, 4th edition, 1969”, in fact these snippets could be Markov generated, but use the same few repeatedly. They would need to be nonsensical enough to be obvious noise to a human reader, or highlighted in some way that the scraper won't pick up on, but a general intelligence like most humans would (perhaps a CSS styled side-note inlined in the main text? — though that would likely have accessibility issues), and you would need to cycle them out regularly or scrapers will get “smart” and easily filter them out, but them appearing fully, numerous times, might mean they have more significant effect on the tokenising process than more entirely random text.
Maybe you can use an open-weights model, assuming that all LLMs converge on similar representations, and use beam-search with inverted probability and repetition penalty or just GPT-2/LLaMA outwith with amplified activations to try and bork the projection matrices, return write pages and pages of phonetically faux English text to affect how the BPE tokenizer gets fitted, or anything else more sophisticated and deliberate than random noise.
All of these would take more resources than a Markov chain, but if the scraper is smart about ignoring such link traps, a periodically rotated selection of adversarial examples might be even better.
Nightshade had comparatively great success, discounting that its perturbations aren't that robust to rescaling. LLM training corpora are filtered very coarsely and take all they can get, unlike the more motivated attacker in Nightshade's threat model trying to fine-tune on one's style. Text is also quite hard to alter without a human noticing, except annoying zero-width Unicode which is easily stripped, so there's no presence of preserving legibility; I think it might work very well if seriously attempted.
Crawlers (both AI and regular search) have a set number of pages they want to crawl per domain. This number is usually determined by the popularity of the domain.
Unknown websites will get very few crawls per day whereas popular sites millions.
Source: I am the CEO of SerpApi.
Probably unethical or not possible, but you could maybe spin up a bunch of static pages on GitHub Pages with random filler text and then have your site redirect to a random one of those instead. Unless web crawlers don’t follow redirects.
Looks like this would tarpit any web crawler.
What's a reasonable way forward to deal with more bots than humans on the internet?
Bug, or feature, this? Could be a way to keep your site public yet unfindable.
In short, if the creator of this thinks that it will actually trick AI web crawlers, in reality it would take about 5 mins of time to write a simple check that filters out and bans the site from crawling. With modern LLM workflows its actually fairly simple and cheap to burn just a little bit of GPU time to check if the data you are crawling is decent.
Only a really, really bad crawl bot would fall for this. The funny thing is that in order to make something that an AI crawler bot would actually fall for you'd have to use LLM's to generate realistic enough looking content. Markov chain isn't going to cut it.
I reported a vulnerability to them that allowed you to get IP addresses of their paying customers.
OpenAI responded “Not applicable” indicating they don’t think it was a serious issue.
The PoC was very easy to understand and simple to replicate.
Edit: I guess I might as well disclose it here since they don’t consider it an issue. They were/are(?) hot linking logo images of third-party plugins. When you open their plugin store it loads a couple dozen of them instantly. This allows those plugin developers (of which there are many) to track the IP addresses and possibly more of who made these requests. It’s straight forward to become a plugin developer and get included. IP tracking is invisible to the user and OpenAI. A simple fix is to proxy these images and/or cache them on the OpenAI server.
1. perplexity filtering - small LLM looks at how in-distribution the data is to the LLM's distribution. if it's too high (gibberish like this) or too low (likely already LLM generated at low temperature or already memorized), toss it out.
2. models can learn to prioritize/deprioritize data just based on the domain name of where it came from. essentially they can learn 'wikipedia good, your random website bad' without any other explicit labels. https://arxiv.org/abs/2404.05405 and also another recent paper that I don't recall...
Bot detection is fairly sophisticated these days. No one bypasses it by accident. If they are getting around it then they are doing it intentionally (and probably dedicating a lot of resources to it). I'm pro-scraping when bots are well behaved but the circumvention of bot detection seems like a gray-ish area.
And, yes, I know about Facebook training on copyrighted books so I don't put it above these companies. I've just never seen it confirmed that they actually do it.
<meta name="robots" content="noindex, nofollow">
Are any search engines respecting that classic meta tag?
I haven't added these scrapers to my robots.txt on the sites I work on yet because I haven't seen any problems. I would run something like this on my own websites, but I can't see selling my clients on running this on their websites.
The websites I run generally have a honeypot page which is linked in the headers and disallowed to everyone in the robots.txt, and if an IP visits that page, they get added to a blocklist which simply drops their connections without response for 24 hours.
As a quick note and not sure if it's already been mentioned, but the main blurb has a typo: "... go back into a the tarpit"
I used to use it when I collected malware.
Archived site: https://web.archive.org/web/20090122063005/http://nepenthes....
Github mirror: https://github.com/honeypotarchive/nepenthes
We finally have a viable mouse trap for LLM scrapers for them to continuously scrape garbage forever, depleting the host of their resources whilst the LLM is fed garbage which the result will be unusable to the trainer, accelerating model collapse.
It is like a never ending fast food restaurant for LLMs forced to eat garbage input and will destroy the quality of the model when used later.
Hope to see this sort of defense used widely to protect websites from LLM scrapers.
Some, uh, sites (forums?) have content that the AI crawlers would like to consume, and, from what I have heard, the crawlers can irresponsibly hammer the traffic of said sites into oblivion.
What if, for the sites which are paywalled, the signup, which invariably comes with a long click-through EULA, had a legal trap within it, forbidding ingestion by AI models on pain of, say, owning ten percent of the company should this be violated. Make sure there is some kind of token payment to get to the content.
Then seed the site with a few instances of hapax legomenon. Trace the crawler back and get the resulting model to vomit back the originating info, as proof.
This should result in either crawlers being more respectful or the end of the hated click-through EULA. We win either way.
Related
HellPot – A portal to endless suffering meant to punish unruly HTTP bots
HellPot is a honeypot that simulates a real website to deter non-compliant HTTP bots, utilizing a Markov engine and offering easy setup, logging, and performance optimization. It supports integration with Nginx and Apache.
13ft – A site similar to 12ft.io but is self hosted
The 13 Feet Ladder project is a self-hosted server that bypasses paywalls and ads, allowing access to restricted content from sites like Medium and The New York Times.
Poisoning AI Scrapers
Tim McCormack is combating AI scrapers by serving altered blog posts with nonsensical text generated by a Markov chain algorithm, aiming to inspire others against unconsented content use by AI companies.
The Rise of the AI Crawler
AI crawlers like GPTBot and Claude are generating significant web traffic but struggle with JavaScript rendering, leading to inefficiencies. Recommendations include server-side rendering and efficient URL management for better accessibility.
The Rise of the AI Crawler
AI crawlers like GPTBot and Claude are generating significant web traffic but struggle with JavaScript execution and efficiency, prompting recommendations for server-side rendering and better URL management.