How to detect bots in a server side script like PHP

Humans and bots will do similar things, but bots will do things that humans don’t. Let’s try to identify those things. Before we look at behavior, let’s accept RayQuang’s comment as being useful. If a visitor has a bot’s user-agent string, it’s probably a bot. I can’t image anybody going around with “Google Crawler” (or something similar) as a UA unless they’re working on breaking something. I know you don’t want to update a list manually, but auto-pulling that one should be good, and even if it stays stale for the next 10 years, it will be helpful.

Some have already mentioned Javascript and image loading, but Google will do both. We must assume there are now several bots that will do both, so those are no longer human indicators. What bots will still uniquely do, however, is follow an “invisible” link. Link to a page in a very sneaky way that I can’t see as a user. If that gets followed, we’ve got a bot.

Bots will often, though not always, respect robots.txt. Users don’t care about robots.txt, and we can probably assume that anybody retrieving robots.txt is a bot. We can go one step further, though, and link a dummy CSS page to our pages that is excluded by robots.txt. If our normal CSS is loaded but our dummy CSS isn’t, it’s definitely a bot. You’ll have to build (probably an in-memory) table of loads by IP and do an not contained in match, but that should be a really solid tell.

So, to use all this: maintain a database table of bots by ip address, possibly with timestamp limitations. Add anything that follows your invisible link, add anything that loads the “real” CSS but ignores the robots.txt CSS. Maybe add all the robots.txt downloaders as well. Filter the user-agent string as the last step, and consider using this to do a quick stats analysis and see how strongly those methods appear to be working for identifying things we know are bots.

– This answer is found on Stackoverflow, and I have pasted it here for my easy reference – you can also find it here.

Leave a Reply

Your email address will not be published. Required fields are marked *