Good bots

Posted on February 20, 2020 by Kenny

Cloudflare offers a field called cf.client.bot that I can use to avoid having my firewall rules inadvertently block search engines and other good bots. But what does Cloudflare consider a “good bot”, and does their definition match mine? Kinda hard to say. Cloudflare does not make an up-to-date list of good bots available to the public. But, I can set a firewall rule to Allow cf.client.bot, then monitor the firewall event log over time to see which bots are being allowed. I’m unlikely to catch all the good bots, but I will get a pretty good idea.

I did just that. Here is my imperfect, point-in-time, list of what Cloudflare considers “good bots”, divided into categories:

1. Search Engines

Baidu: Chinese search engine
Bing: Microsoft search engine
coccocbot: Vietnamese search engine
Duck-Duck-Go: Search engine emphasizing privacy
Google: Massive search engine
MSN: Microsoft search engine
MojeekBot: UK search engine
Naver: South Korean search engine
Seznam: Czech search engine
Sogou: Chinese search engine
Yahoo: Search engine
Yandex: Russian search engine

2. Monitoring

LinkCheck
Pingdom
SiteUptime
Uptime

3. Social Media

Facebook: Social media
Feedbin: Feed fetcher
Linkedin: Social media
Mail.ru: Russian social media
Pintrest: Social media

4. Miscellaneous

Archive.org: The famous Wayback Machine
Adsbot-Google: Checks webpage ad quality
AppleBot: Finds information for Siri and Spotlight Suggestions

5. Research

BLEXBot
CCBot
Cliqzbot
DomainStatsBot
Grapeshot
Obot
Photon
Proximic
SemrushBot

Do I agree these are good bots?

Search engines: Yes, definitely. I want my site indexed in every major search engine.
Monitoring: Yes. I use Uptime, Broken Link Check, and others. I do not want those bots blocked.
Social Media: Yes.
Miscellaneous: Yes.
Research: No! These bots gather information from my site, process it, and use if for some commercial or academic purpose. They do me no harm, gather only data that I have willingly made publicly available, and provide legitimate monetary or other benevolent value – but for someone who is not me. Sorry, that’s not good enough. If you’re gonna crawl my site, mine my data, and use my bandwidth, you need to do something that benefits me.

Fortunately most of these research bots are “good” in the sense that they obey robots.txt directives. I can block them there and still use the firewall rule above to white-list the other good bots. Proximic and Grapeshot seem to be be exceptions. Both claim to obey robots.txt but neither seems to obey mine. I have to block Proximic and Grapeshot using firewall rules.

Good bots

Categories

Featured Posts