1

I've started tracking user-agent strings on a website at the start of each session. Looking at the data for this month so far I'm seeing on search engine bot that keeps coming up a lot..

Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)

From 9/1/2011 to 9/13/2011 I've logged 2090 hits from this user-agent. From other search engines I'm tracking much lower numbers of hits...

Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp) - 353

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) - 175

Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) - 110

www.baidu.com seems to be a Chinese version of Google. Is there a way to throttle their bot? I don't mind them indexing us... in fact its probably a good thing as we have a large asian population utilizing the site, but they seem to be doing it a lot more.

Justin808
  • 307
  • 3
  • 11

2 Answers2

2

You want to throttle the bot, but you don't appear to know WHY you want to do this.
Are you experiencing a performance impact? Is the traffic pushing you over a bandwidth or transfer threshold?

Throttling a bot "just because" is a waste of effort - If it is not hurting you I suggest that you leave it alone.

If it is causing problems you can take steps using sitemaps.xml to limit how often the bot crawls, or robots.txt directives to limit the crawl rate. Note that both of these can be ignored, which would only leave you the option of blocking the user agent using (e.g.) an Apache mod_rewrite rule -- this would also result in your not being indexed...

voretaq7
  • 79,345
  • 17
  • 128
  • 213
  • The WHY is a valid point. I was an anomaly in my data that seemed odd and out of place. If this is normal for this search engine then i guess its not really a problem. – Justin808 Sep 13 '11 at 20:47
  • 2
    Not all anomalies are problems, though it's good that you're aware of the anomaly. Search engine crawls can be bursty at times, and certainly if it's affecting performance it's something that needs to be addressed, but if there's no negative impact I'd let them do their indexing run however they want. The faster it finishes the faster your data will be updated in their index, and theoretically it won't have to re-crawl those pages for a while. – voretaq7 Sep 13 '11 at 20:50
0

I wrote this response to a similar question yesterday: Blocking by user-agent string in httpd.conf not effective

Which basically says this:

If you don't want specific user-agents (robots) indexing you, follow [those] steps. If you don't want ANY robots to index you, follow [those2] steps.

It uses either the httpd.conf file, or if easier, the .htaccess file and sets some rewrite rules. Hope it's useful to you. As for limiting the amount of times they can index you, you'd need to (like google) prove you own the website, then go into their "webmaster tools" and select very slow indexing rates. But here's my input:

<2-cents>
Unless the bots slow your server down, let it be. They don't hurt unless they are "bad bots" and access sensitive data.
</2-cents>

Good luck.

U4iK_HaZe
  • 631
  • 5
  • 13