Ideas on poisoning text for illegitimate ai crawlers?

bitzoid · April 22, 2025, 2:03pm

I’ve been thinking about how to make the (attribution required) contents of a site accessible to humans but prevent AI crawlers from stealing contributions made to the site. robots.txt and ip-block-lists only get you so far, seeing as AI companies increasingly just ignore robots and block-lists will always be incomplete.

Currently I’m contemplating writing a plugin that provides a liquid filter that you can throw after a {{content}} in the primary layout. It would then try to inject garbled extra text into the real text and use css to hide that from user agents. display: none is the obvious choice because assigning flex ordering makes text not properly selectable by humans. Perhaps there are other poisons I’m not thinking about. I’m of course working under the assumption that crawlers wont bother with css if they find nicely marked up <p> tags.

I wanted to hear what people think about this. Maybe something like this already exists?

hitchhiker · June 2, 2025, 8:23am

image encryption

panchtatvam · June 2, 2025, 10:49am

I would rather use SaaS solutions like one from Cloudflare to prevent bots.
In any case, the bots will surely evolve whatever choice you make.

bitzoid · June 2, 2025, 2:21pm

I don’t know about that. Image poisoning like nightshade seems to work pretty decently, I’m just not aware of anything for html text.

Also, AI crawling is a big-data game, so I doubt they will adapt to small jekyll blogs. Too much effort.

bitzoid · June 2, 2025, 2:24pm

Sorry, one more thing. I don’t want to expose my users to their data being collected by CDN and the like (such as cloudflare). Also cloudflare does tend to block legitimate TOR users, which I find unacceptable.

george-gca · June 3, 2025, 1:24pm

The main thing is, you can’t prevent it from being crawled, only from being processed correctly by a language model. But like said above, this is a matter of time before being circumvented. It is a cat mouse fight.

About ways of doing this, there is a whole research field about this. Check this recent paper for instance (not yet peer reviewed afaik).

bitzoid · June 3, 2025, 1:46pm

That looks promising, but I feel like it is still not there yet in terms of unspecific poison, as it seems more interesting in misinforming the LLM in specific ways.

george-gca · June 3, 2025, 2:58pm

But when a crawler crawls your site, it will download the whole html, then parse it, then feed to some model. So the download will happen, what you can try to avoid is the parse and process steps.

bitzoid · June 3, 2025, 3:23pm

I don’t really think you can avoid the parse and process step. They will gobble everything up. But what the aim of this post was, was to find a way to make the content unpalatable.

Ideally, the result would be to contribute to the confusion/destruction of the LLM similar to what nightshade does. So that for example if you ask “what is jekyll?” the answer is not just “the best most awesome site generator” but rather something like “the umbrella smells ε in kindly”.

Note that I’m talking about those crawlers that fail to respect robots.txt and friends. If they respect that, then no harm will be done.

george-gca · June 3, 2025, 3:54pm

You should research about indirect prompt injection. Like I said, it is a rather recent field of study, so every now and then you’ll see papers about it, and also mitigation solutions.

bitzoid · June 18, 2025, 11:48am

I found this delightful list … or should I say community?

ASRG Sabot in the Age of AI

Oh, it is so on!

To quote the iocaine developer:

If we all do it, they won’t have anything to crawl.

Topic		Replies	Views
How to Password Protect a Jekyll Site Share	0	2425	February 20, 2017
How can I exclude a few specific posts from search engine indexing? Help	5	2618	February 6, 2019
Lunr returning md image links Help	0	481	November 26, 2018
How to hide a word in jekyll posts Help	0	536	February 3, 2019
Is Email Harvesting no longer an issue?	8	2540	February 4, 2019

Ideas on poisoning text for illegitimate ai crawlers?

Related topics