What to do about spoofed user agents? Scrapers pretending to be spiders

Question

I've been following a few spiders in our logs and I did a traceroute on their ip to find out they are in fact EC2 instances. The user agents are listed as Google bot and msnbot but they are not Google or MS ip's. Is there anything I can do, is spoofing user agents a common practice? I'm guessing that if I ban their ip(which I've done) they will just start a new instance and carry on. I don't want to ban all EC2 instances though.

score 5 · Accepted Answer · answered Mar 21 '11 at 21:36

When you really start to delve into logs, you'll find that a huge number of robots do header spoofing; most of them spoofing as IE (some of them unsuccessfully; typos get your agent string spotted fast!).

There's an interesting EFF experiment that looks at uniquely identifying users with the data presented by the browser: Panopticlick. Gathering more info at the application level to attempt to block could potentially get you somewhere, as the non-browser nodes will fail to return some of those fields.

But, in the same sense that blocking their IPs is unlikely to work for long, attempting to block based on user-agent (or any other unique criteria) is unlikely to work for long if they're determined to spider you. In the end, it's not going to be worth your time or energy to try to block every rogue bot on the net; just set up your robots.txt file, keep an eye out for the nasty ones trying to hit you with SQL injection or the like, and rest easy.

What to do about spoofed user agents? Scrapers pretending to be spiders

1 Answers1