Poisoning AI Scrapers
Tim McCormack is combating AI scrapers by serving altered blog posts with nonsensical text generated by a Markov chain algorithm, aiming to inspire others against unconsented content use by AI companies.
Read original articleTim McCormack discusses his initiative to combat AI scrapers by serving altered versions of his blog posts, which he refers to as "poisoned" content. He believes that simply using a robots.txt file is insufficient since scrapers have already accessed his work without consent. His approach involves detecting AI scrapers and delivering a modified version of his posts that contains nonsensical text, thereby potentially hindering the training of language models. McCormack employs a Markov chain algorithm, known as the Dissociated Press, to generate this "garbage" text, which superficially resembles coherent writing but lacks meaningful content. He has developed a custom tool in Rust to facilitate this process, ensuring that the altered posts are generated only when the original content changes. The implementation includes a mod_rewrite rule to serve the altered content specifically to identified AI scrapers while preventing direct access to the poisoned files. Although he acknowledges that his efforts alone will not significantly impact AI training, he hopes to inspire others to adopt similar strategies. McCormack's project serves as both a technical exercise and a form of protest against the unconsented use of online content by AI companies.
- Tim McCormack is serving altered blog posts to deter AI scrapers.
- The modified content is generated using a Markov chain algorithm to create nonsensical text.
- A custom tool in Rust was developed to automate the generation of this "poisoned" content.
- The approach aims to inspire others to take similar actions against AI scraping.
- McCormack's initiative reflects a broader concern about the unconsented use of online content by AI companies.
Related
I Received an AI Email
A blogger, Tim Hårek, received an AI-generated email from Raymond promoting Wisp CMS. Tim found the lack of personalization concerning, leading him to question the ethics of AI-generated mass emails.
"Copyright traps" could tell writers if an AI has scraped their work
Researchers at Imperial College London developed "copyright traps" to help content creators detect unauthorized use of their work in AI training datasets by embedding hidden text. This method faces challenges but offers potential solutions.
Anthropic is scraping websites so fast it's causing problems
Anthropic faces criticism for aggressive web scraping while training its Claude model, causing disruptions to websites like Ifixit.com and Freelancer.com, raising ethical concerns about data usage and content creator rights.
Websites Are Blocking the Wrong AI Scrapers
Outdated robots.txt instructions are causing confusion, blocking old AI scrapers while allowing CLAUDEBOT to scrape freely. Many sites haven't updated their blocklists, complicating management for website owners.
Some Suggestions to Improve Robots.txt
Needham and O'Hanlon's paper suggests improving the robots.txt protocol for generative AI, highlighting its potential and risks. The BBC opposes unauthorized content scraping, advocating for structured agreements with tech companies.
FYI Google-Extended isn't a dedicated scraper, you'll never see requests coming from that user agent so that rewrite rule won't do anything. When GoogleBot parses robots.txt it looks for Google-Extended and those rules are used to determine whether or not the data scraped by GoogleBot can be used for training. Just throw this robots.txt on your site in addition to those rewrite rules to cover all bases.
https://raw.githubusercontent.com/ai-robots-txt/ai.robots.tx...
Related
I Received an AI Email
A blogger, Tim Hårek, received an AI-generated email from Raymond promoting Wisp CMS. Tim found the lack of personalization concerning, leading him to question the ethics of AI-generated mass emails.
"Copyright traps" could tell writers if an AI has scraped their work
Researchers at Imperial College London developed "copyright traps" to help content creators detect unauthorized use of their work in AI training datasets by embedding hidden text. This method faces challenges but offers potential solutions.
Anthropic is scraping websites so fast it's causing problems
Anthropic faces criticism for aggressive web scraping while training its Claude model, causing disruptions to websites like Ifixit.com and Freelancer.com, raising ethical concerns about data usage and content creator rights.
Websites Are Blocking the Wrong AI Scrapers
Outdated robots.txt instructions are causing confusion, blocking old AI scrapers while allowing CLAUDEBOT to scrape freely. Many sites haven't updated their blocklists, complicating management for website owners.
Some Suggestions to Improve Robots.txt
Needham and O'Hanlon's paper suggests improving the robots.txt protocol for generative AI, highlighting its potential and risks. The BBC opposes unauthorized content scraping, advocating for structured agreements with tech companies.