5

I want to block all request from yandex.ru search bot. It is very traffic intensive (2GB/day). I first blocked one C class IP range, but it seems this bot appear from different IP ranges.

For example:

spider31.yandex.ru -> 77.88.26.27 spider79.yandex.ru -> 95.108.155.251 etc..

I can put some deny in robots.txt but not sure if it respect this. I am thinking of blocking a list of IP ranges.

Can somebody suggest some general solution.

Ross
  • 268
  • 1
  • 3
  • 9

6 Answers6

6

Don't believe what you read on forums about this! Trust what your server logs tell you. If Yandex obeyed robots.txt, you would see the evidence in your logs. I have seen for myself that Yandex robots do not even READ the robots.txt file!

Quit wasting time with long IP lists that only serve to slow down your site drastically.

Enter the following lines in .htaccess (in the root folder of each of your sites):

SetEnvIfNoCase User-Agent "^Yandex*" bad_bot
Order Deny,Allow
Deny from env=bad_bot

I did, and all Yandex gets now are 403 Access denied errors.

Good bye Yandex!

Mark Henderson
  • 68,316
  • 31
  • 175
  • 255
  • Thanks. I will try this. I currently block 2 IP ranges only, but who knows in the future. – Ross May 25 '10 at 14:23
  • I use the same rules in my Apache configurations to stop Yandex as well. I tried robots.txt but they seemed to ignore it. – Jeremy Bouse Aug 09 '10 at 00:19
3

I'm too young here (reputation) to post all the URLs I need to as hyperlinks, so pardon my parenthesized URLs, please.

The forum link from Dan Andreatta, and this other one, have some but not all of what you need. You'll want to use their method of finding the IP numbers, and script something to keep your lists fresh. Then you want something like this, to show you some known values including the sub-domain naming schemes they have been using. Keep a crontabbed eye on their IP ranges, maybe automate something to estimate a reasonable CIDR (I didn't find any mention of their actual allocation; could just be google fail @ me).

Find their IP range(s) as accurately as possible, so you don't have to waste time doing a reverse DNS look up while users are waiting for (http://yourdomain/notpornipromise), and instead you're only doing a comparison match or something. Google just showed me grepcidr , which looks highly relevant. From the linked page: "grepcidr can be used to filter a list of IP addresses against one or more Classless Inter-Domain Routing (CIDR) specifications, or arbitrary networks specified by an address range." I guess it is nice that its a purpose built code with known I/O, but you know that you can reproduce the function in a billion different ways.

The most, "general solution", I can think of for this and actually wish to share (speaking things into existence and all that) is for you to start writing a database of such offenders at your location(s), and spend some off-hours thinking and researching on ways to defend and counter attack the behavior. This takes you deeper into intrusion detection, pattern analysis, and honey nets, than the scope of this specific question truly warrants. However, within the scope of that research are countless answers to this question you have asked.

I found this due to Yandex's interesting behavior on one of my own sites. I wouldn't call what I see in my own log abusive, but spider50.yandex.ru consumed 2% of my visit count, and 1% of my bandwidth... I can see where the bot would be truly abusive to large files and forums and such, neither of which are available for abuse on the server I'm looking at today. What was interesting enough to warrant investigation was the bot looking at /robots.txt, then waiting 4 to 9 hours and asking for a /directory/ not in it, then waiting 4 to 9 hours, asking for /another_directory/, then maybe a few more, and /robots.txt again, repeat ad finitum. So far as frequency goes, I suppose they're well behaved enough, and the spider50.yandex.ru machine appeared to respect /robots.txt.

I'm not planning to block them from this server today, but I would if I shared Ross' experience.

For reference on the tiny numbers we're dealing with in my server's case, today:

Top 10 of 1315 Total Sites By KBytes
 # Hits  Files  KBytes   Visits  Hostname
 1 247 1.20% 247 1.26% 1990 1.64% 4 0.19% ip98-169-142-12.dc.dc.cox.net
 2 141 0.69% 140 0.72% 1873 1.54% 1 0.05% 178.160.129.173
 3 142 0.69% 140 0.72% 1352 1.11% 1 0.05% 162.136.192.1
 4 85 0.41% 59 0.30% 1145 0.94% 46 2.19% spider50.yandex.ru
 5 231 1.12% 192 0.98% 1105 0.91% 4 0.19% cpe-69-135-214-191.woh.res.rr.com
 6 16 0.08% 16 0.08% 1066 0.88% 11 0.52% rate-limited-proxy-72-14-199-198.google.com
 7 63 0.31% 50 0.26% 1017 0.84% 25 1.19% b3090791.crawl.yahoo.net
 8 144 0.70% 143 0.73% 941  0.77% 1 0.05% user10.hcc-care.com
 9 70 0.34% 70 0.36% 938  0.77% 1 0.05% cpe-075-177-135-148.nc.res.rr.com
10 205 1.00% 203 1.04% 920  0.76% 3 0.14% 92.red-83-54-7.dynamicip.rima-tde.net

That's in a shared host who doesn't even bother capping bandwidth anymore, and if the crawl took some DDoS-like form, they would probably notice and block it before I would. So, I'm not angry about that. In fact, I much prefer having the data they write in my logs to play with.

Ross, if you really are angry about the 2GB/day you're losing to Yandex, you might spampoison them. That's what it's there for! Reroute them from what you don't want them downloading, either by HTTP 301 directly to a spampoison sub-domain, or roll your own so you can control the logic and have more fun with it. That sort of solution gives you the tool to reuse later, when it's even more necessary.

Then start looking deeper in your logs for funny ones like this:

217.41.13.233 - - [31/Mar/2010:23:33:52 -0500] "GET /user/ HTTP/1.1" 404 15088 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)"
217.41.13.233 - - [31/Mar/2010:23:33:54 -0500] "GET /user/ HTTP/1.1" 404 15088 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)"
217.41.13.233 - - [31/Mar/2010:23:33:58 -0500] "GET /user/ HTTP/1.1" 404 15088 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)"
217.41.13.233 - - [31/Mar/2010:23:34:00 -0500] "GET /user/ HTTP/1.1" 404 15088 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)"
217.41.13.233 - - [31/Mar/2010:23:34:01 -0500] "GET /user/ HTTP/1.1" 404 15088 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)"
217.41.13.233 - - [31/Mar/2010:23:34:03 -0500] "GET /user/ HTTP/1.1" 404 15088 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)"
217.41.13.233 - - [31/Mar/2010:23:34:04 -0500] "GET /user/ HTTP/1.1" 404 15088 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)"
217.41.13.233 - - [31/Mar/2010:23:34:05 -0500] "GET /user/ HTTP/1.1" 404 15088 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)"
217.41.13.233 - - [31/Mar/2010:23:34:06 -0500] "GET /user/ HTTP/1.1" 404 15088 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)"
217.41.13.233 - - [31/Mar/2010:23:34:09 -0500] "GET /user/ HTTP/1.1" 404 15088 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)"
217.41.13.233 - - [31/Mar/2010:23:34:14 -0500] "GET /user/ HTTP/1.1" 404 15088 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)"
217.41.13.233 - - [31/Mar/2010:23:34:16 -0500] "GET /user/ HTTP/1.1" 404 15088 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)"
217.41.13.233 - - [31/Mar/2010:23:34:17 -0500] "GET /user/ HTTP/1.1" 404 15088 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)"
217.41.13.233 - - [31/Mar/2010:23:34:18 -0500] "GET /user/ HTTP/1.1" 404 15088 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)"
217.41.13.233 - - [31/Mar/2010:23:34:21 -0500] "GET /user/ HTTP/1.1" 404 15088 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)"
217.41.13.233 - - [31/Mar/2010:23:34:23 -0500] "GET /user/ HTTP/1.1" 404 15088 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)"
217.41.13.233 - - [31/Mar/2010:23:34:24 -0500] "GET /user/ HTTP/1.1" 404 15088 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)"
217.41.13.233 - - [31/Mar/2010:23:34:26 -0500] "GET /user/ HTTP/1.1" 404 15088 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)"
217.41.13.233 - - [31/Mar/2010:23:34:27 -0500] "GET /user/ HTTP/1.1" 404 15088 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)"
217.41.13.233 - - [31/Mar/2010:23:34:28 -0500] "GET /user/ HTTP/1.1" 404 15088 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)"

Hint: No /user/ directory, nor a hyperlink to such, exists on the server.

wfaulk
  • 6,828
  • 7
  • 45
  • 75
Daniel
  • 41
  • 2
  • Here is some of their IP information: http://cqcounter.com/whois/index.php?query=213.180.199.34 http://cqcounter.com/whois/index.php?query=77.88.19.60 route: 213.180.192.0/19 descr: Yandex network route: 213.180.199.0/24 descr: Yandex enterprise network route: 77.88.0.0/18 descr: Yandex enterprise network – Daniel May 01 '10 at 13:14
  • Thanks, a lot of useful information. I will try some of this. Meanwhile for some of my web sites i put a white-list robots.txt for google, yahoo and msn only. – Ross May 02 '10 at 10:22
  • Ah yes. Good call on the white list. You're welcome. – Daniel May 04 '10 at 07:03
1

According to this forum, the yandex bot is well behaved and respects robots.txt.

In particular they say

The behaviour of Yandex is quite a lot like that of Google with regard to robots.txt .. The bot doesn't look at the robots.txt every single time it enters the domain.

Bots like Yandex, Baudi, and Sohu have all been fairly well behaved and as a result, are allowed. None of them have ever gone places I didn't want them to go, and parse rates don't break the bank with regard to bandwidth.

Personally I do not have issues with it, and googlebot is by far the most aggressive crawler for the sites I have.

Dan Andreatta
  • 5,384
  • 2
  • 23
  • 14
  • It is definitely not well behaving and works more like ddos attack. I do not want to relay on his mercy to respect the rules. On some of my web sites, I have no problems with it, but on 2 of them it make a lot of request for a single page multiple times in second. – Ross Apr 29 '10 at 09:18
  • It could be someone just pretending to be the yandex bot. Do the IP addresses the request come from originate from a Yandex server? – Dan Andreatta Apr 29 '10 at 09:42
  • Yes it comes from spider*.yandex.ru. I have found a lot of people have similar problem with yandex. – Ross Apr 29 '10 at 10:44
1

My current solution is this (for NGINX web server):

if ($http_user_agent ~* (Yandex) ) {
        return 444;
}

This is case insensitive. It returns response 444.

This directive looks at the User Agent string and if "Yandex" is detected connection is closed without sending any headers. 444 is a custom error code understood by the Nginx daemon

Mark Henderson
  • 68,316
  • 31
  • 175
  • 255
Ross
  • 268
  • 1
  • 3
  • 9
1

Get nasty by adding these lines to your .htaccess file to target all visitors from 77.88.26.27 (or whatever the IP is) who try to access a page ending in .shtml:

# permanently redirect specific IP request for entire site
Options +FollowSymlinks
RewriteEngine on
RewriteCond %{REMOTE_HOST} 77\.88\.26\.27
RewriteRule \.shtml$ http://www.youtube.com/watch?v=oHg5SJYRHA0 [R=301,L]

That Yandex bot now gets rickrolled every time it tries to index your site. Problem solved.

0

Please, people look up the OSI model. I recommend you to block this networks in routing level. This is the third (4th transport) layer of the network OSI model. If you block them in server level it is in the 4th (5,6,7th) layer and has already passed. Also a Kernel is able to handle those request 100 times better as a Apache Server. RewriteRule over RewriteRule, SetEnv directives and so on are just bugging your server, regardless if you present the cool 403. A Request is a request and Yandex also Baidu doing a lots of them, while google is also scanning in the background! You really like to be flooded by requests, this costs you webserver slots and Baidu is known for doing this by intention.

61.51.16.0 - 61.51.31.255         61.51.16.0/20     # (Baidu China - Beijing)
14.136.0.0 - 14.136.255.255       14.136.0.0/16     # (Baidu China - H.K.)
123.125.71.0 - 123.125.71.255     123.125.71.0      # (Baidu China)
14.208.0.0 - 14.223.255.255       14.208.0.0/12     # (Baidu China)
95.108.241.0 - 95.108.241.255     95.108.241.0      # (YandexBot Russian Federation)
95.108.151.0 - 95.108.151.255     95.108.151.0      # (YandexBot Russian Federation)
119.63.192.0 - 119.63.199.255     119.63.192.0/21   # (Baidu Japan Inc.)
119.63.192.0 - 119.63.199.255     119.63.196.0/24   # (Baidu Japan Inc.)        
180.76.0.0 - 180.76.255.255       180.76.0.0/16     # (Baidu China, Baidu Plaza, Beijing)
220.181.0.0 - 220.181.255.255     220.181.108.0/24  # (CHINANET Beijing Province Network)

New Ranges: (Updated Tue, May, 8th, 2012)

123.125.71.0 - 123.125.71.255     123.125.71.0/24   # (Baidu China)
202.46.32.0 - 202.46.63.255       202.46.32.0/19    # (Baidu China)

New Ranges: (Updated Sun, May, 13th, 2012)

39.112.0.0 - 39.127.255.255       39.112.0.0/12     # KOREAN
211.148.192.0 - 211.148.223.255   211.148.192.0/19  # China (ShenZhen)
bigduke5
  • 11
  • 2
  • 2
    That was a remarkably verbose way to say "Block their IPs". – wfaulk May 05 '12 at 12:53
  • Yeah, could say so ;). This bots sucked on my last nerves. I bet Baidu China has more crawling servers like google would ever use. This looks more like flooding and spying to me. – bigduke5 May 06 '12 at 06:55