Humans and bots will do similar things, but bots will do things that humans don’t. Let’s try to identify those things. Before we look at behavior, let’s accept RayQuang’s comment as being useful. If a visitor has a bot’s user-agent string, it’s probably a bot. I can’t image anybody going around with “Google Crawler” (or something similar) as a UA unless they’re working on breaking something. I know you don’t want to update a list manually, but auto-pulling that one should be good, and even if it stays stale for the next 10 years, it will be helpful.
Bots will often, though not always, respect robots.txt. Users don’t care about robots.txt, and we can probably assume that anybody retrieving robots.txt is a bot. We can go one step further, though, and link a dummy CSS page to our pages that is excluded by robots.txt. If our normal CSS is loaded but our dummy CSS isn’t, it’s definitely a bot. You’ll have to build (probably an in-memory) table of loads by IP and do an not contained in match, but that should be a really solid tell.
So, to use all this: maintain a database table of bots by ip address, possibly with timestamp limitations. Add anything that follows your invisible link, add anything that loads the “real” CSS but ignores the robots.txt CSS. Maybe add all the robots.txt downloaders as well. Filter the user-agent string as the last step, and consider using this to do a quick stats analysis and see how strongly those methods appear to be working for identifying things we know are bots.
– This answer is found on Stackoverflow, and I have pasted it here for my easy reference – you can also find it here.