Poisoning LLMs And Other Nefarious Bots With Iocaine and Nginx
I've been aware that there's various efforts to poison or sabotage the efforts of "AI" bots that aggressively scrape and harvest web content to feed large language model datasets. As I was checking my feed today, I came across iocaine and decided to finally give one of these tools a shot on my own website. I already use Nginx Ultimate Bad Bot Blocker to block malicious IP addresses and user agents by returning an empty response.
NUBBB functions by mapping IP addresses, user agents, and referrers to numerical values, which are then handled using if statements. To add the ability to direct to iocaine, I simply added onto that setup.
Before proceeding, prepare to backup/restore/repair your nginx configuration in the event that something goes wrong and breaks, and heed this warning about iocaine:
This is deliberately malicious software, intended to cause harm. Do not deploy if you aren’t fully comfortable with what you are doing. LLM scrapers are relentless and brutal, they will place additional burden on your server, even if you only serve static content. With iocaine, there’s going to be increased computing power used. It’s highly recommended to implement rate limits at the reverse proxy level, such as with the caddy-ratelimit plugin, if using Caddy.
Entrapment is done by the reverse proxy. Anything that ends up being served by iocaine will be trapped there: there are no outgoing links. Be careful what you route towards it.
Try this at your own risk.
Setting Up Iocaine
Since I already use Docker for a couple things on my server, I just went for running iocaine via a container as well. If you would rather just run a binary directly, you can follow the applicable deployment and configuration instructions.
First, I made a folder to store the compose file and data directory and downloaded the compose.yml to it.
I made a couple minor adjustments - specifying a network name and changing container-volume
to data
- and created the data
directory that the container will access. Inside it, I put the sample wordlist.txt and training-text.txt as placeholders.
Finally, running docker compose up -d
started the container, which would read from ./data/wordlist.txt
and ./data/training-text.txt
, and serve on port 42069. Running curl http://localhost:42069
would show me a page with some nonsense text and links for a bot to attempt to crawl through.
Here is my config if you'd like to use it.
Configuring NUBBB
I first created a list of user agents to target at /etc/nginx/bots.d/poison.conf
:
"~*(?:\b)GPTBot(?:\b)" poison;
"~*(?:\b)Amazonbot(?:\b)" poison;
"~*(?:\b)Claude-Web(?:\b)" poison;
"~*(?:\b)AnotherBadUserAgent(?:\b)" poison;
NUBBB typically uses numbers for mapping, but I just used poison
to make it stand out a bit more for me.
After that, I edited /etc/nginx/conf.d/globalblacklist.conf
and searched for this line:
include /etc/nginx/bots.d/blacklist-user-agents.conf;
Then, I inserted the following above it so that nginx will prioritize its options:
include /etc/nginx/bots.d/poison.conf;
This is the config that sources the different lists of user agents for what to block.
In /etc/nginx/bots.d/blockbots.conf
, there is the following if statement:
if ($bad_bot = '3') {
return 444;
}
I added the following above it so that it will be prioritized:
if ($bad_bot = 'poison') {
rewrite ^ /@AIpoison$request_uri ;
}
location /@AIpoison {
proxy_set_header Host $host;
proxy_pass http://localhost:42069;
}
If a user agent is mapped to the value poison
, it will be rewritten to /@AIpoison
, which directs to port 42069 where iocaine is being served.
Verify Vhost Config
It's worth double-checking any nginx vhost configurations to ensure that they are properly sourcing the NUBBB configs. I use certbot to issue TLS certificates and update my nginx vhosts to include them, and the NUBBB setup would not process them correctly and it would not work; I needed to fix that.
For a vhost to utilize the blocking configurations, they need the following lines inside the server block:
include /etc/nginx/bots.d/ddos.conf;
include /etc/nginx/bots.d/blockbots.conf;
When issuing a certificate for a vhost, certbot will update the config file to give it two server blocks; the first one has all the actual configuration options, while the second one simply handles redirecting requests on port 80 to port 443 with HTTPS.
NUBBB would place the include lines in the second server block instead of the first one, which meant HTTPS connections would not be subjected to the blocking configs. I moved those two include lines to the first server block that contained all the main options to correct that.
server {
NUBBB includes; # They should be here...
location / {
}
certificates;
the actual shit;
}
server {
listen on 80 and redirect to 443;
NUBBB includes; # ...Not here!
}
Verifying That It Works
After running nginx -t
to test my configuration for errors and nginx -s reload
to reload nginx, I verified that my setup was working by using the following command:
curl -sLA "GPTBot" molytov.xyz
This should return an iocaine page rather than the main website.
curl -sLA "ZumBot" https://molytov.xyz
Assuming this user agent is in the general blocklist but not the poison list, this should yield an empty output.
curl -sLA "curl/7.88.1" molytov.xyz
This should just return the regular website.