Robots dot text (robots.txt) is a really interesting, conflicted, frequently disrespected – but useful – little file. Its intended purpose is to give me control of how bots visit my site. Depending on the bot though, my robots dot text directives might be obeyed, ignored, partially obeyed, and/or interpreted in different ways.
Interweb guides invariably point out that robots.txt is only useful with good bots – bad bots ignore it. While true, this statement misses much of the point. Bot behavior runs a spectrum – not just good or bad. I’ll group bots into three categories – good, bad, or nonbinary.
Good bots are beneficial to my site. The bots from the major search engines, in particular Googlebot, are examples. They index my site and make my content available to their users. Not even Googlebot, however, completely obeys my robots dot text – it ignores my crawl-delay directive.
Bad bots are flat out evil – deliberately trying to do me harm. Examples include spambots and malicious login attempts. My robots dot text directives are useless against bad bots – they simply disregard it.
Nonbinary bots neither help me nor hurt me – not deliberately anyway. Mostly, they gather my information for the purposes of their master. They may or may not obey robots dot text, and if they obey it they may do so in different ways. Examples of nonbinary bots include …
- AhrefsBot: Collects my data which Ahrefs then sells to online marketers.
- DotBot and ezooms: Intended to mine eCommerce sites, Dotbot and ezooms look for product names, images, prices, and descriptions, and republishes the content on Dotmic.
- Grapeshot: Uses probabilistic algorithms to examine and analyse my content. The results are sold to advertisers.
Ideally, I want to …
- Throw the doors mostly – not completely – open to good bots. So … in my robots dot text …
# All bots - please keep out of places you have no business snooping. Also, once you visit, stay away for awhile. I can't miss you if you won't go away ...
- Block nonbinary bots too. I don’t hates them – their lifestyle is none of my business, and they are mining the data that I deliberately made available on my public pages. But, they are using a bit of my bandwidth without providing any benefit to me in return. I have not been able to find a credible, comprehensive list of these bots, so – blocking them one by one as I notice them in my log files. So far …
- Robots dot text does no good against bad bots, so CloudFlare firewall rules instead. Block those naughty bots from any place they could cause damage. This raises an obvious question – If CF firewall rules are so effective, why bother with robots.txt at all? I want to block bots with robots.txt whenever practical, with CF firewall rules as a second tier of defense. This keeps my firewall log from becoming so cluttered up with routine bot blocks that I may not notice nefarious activity that should have my attention.
WPPOV supports freedom from Net Neutrality and the GDPR. The Internet of the people, by the people, for the people, shall not perish from the Earth.