September 24th, 2024

Poisoning AI Scrapers

Tim McCormack is combating AI scrapers by serving altered blog posts with nonsensical text generated by a Markov chain algorithm, aiming to inspire others against unconsented content use by AI companies.

Read original articleLink Icon
Poisoning AI Scrapers

Tim McCormack discusses his initiative to combat AI scrapers by serving altered versions of his blog posts, which he refers to as "poisoned" content. He believes that simply using a robots.txt file is insufficient since scrapers have already accessed his work without consent. His approach involves detecting AI scrapers and delivering a modified version of his posts that contains nonsensical text, thereby potentially hindering the training of language models. McCormack employs a Markov chain algorithm, known as the Dissociated Press, to generate this "garbage" text, which superficially resembles coherent writing but lacks meaningful content. He has developed a custom tool in Rust to facilitate this process, ensuring that the altered posts are generated only when the original content changes. The implementation includes a mod_rewrite rule to serve the altered content specifically to identified AI scrapers while preventing direct access to the poisoned files. Although he acknowledges that his efforts alone will not significantly impact AI training, he hopes to inspire others to adopt similar strategies. McCormack's project serves as both a technical exercise and a form of protest against the unconsented use of online content by AI companies.

- Tim McCormack is serving altered blog posts to deter AI scrapers.

- The modified content is generated using a Markov chain algorithm to create nonsensical text.

- A custom tool in Rust was developed to automate the generation of this "poisoned" content.

- The approach aims to inspire others to take similar actions against AI scraping.

- McCormack's initiative reflects a broader concern about the unconsented use of online content by AI companies.

Link Icon 2 comments
By @jsheard - 4 months
> "Google-Extended" is their scraper

FYI Google-Extended isn't a dedicated scraper, you'll never see requests coming from that user agent so that rewrite rule won't do anything. When GoogleBot parses robots.txt it looks for Google-Extended and those rules are used to determine whether or not the data scraped by GoogleBot can be used for training. Just throw this robots.txt on your site in addition to those rewrite rules to cover all bases.

https://raw.githubusercontent.com/ai-robots-txt/ai.robots.tx...