2

Google bot is crawling my site right now and it's killing my server. Its only crawiling one or two pages a second, but those pages are really CPU intensive. I have already added those CPU intensive files to the robots.txt file, but googlebot hasn't detected those changes yet. I want to block google bot at the apache.cong level so my site can be back right now. How can I do this? This one apoache instance is hosting a few PHP sites and a django powered site, so I can't use .htaccess files. The server is running Ubuntu 10.04.

nbv4
  • 593
  • 3
  • 9
  • 18
  • did you look this ? http://serverfault.com/questions/128937/how-do-i-rate-limit-googles-crawl-of-my-class-c-ip-block – aif Nov 12 '10 at 10:08
  • 5
    If you have some pages that are so CPU intensive that googlebot alone can kill your server, what do you do with your visitors? Maybe you should look at the site code instead of blocking googlebot. – Frederik Oct 20 '12 at 18:45
  • Fix the problem, not the symptom. – John Gardeniers Oct 20 '12 at 23:12

5 Answers5

6

I see you are currently trying to use glob-patters in your robots.txt.

From The web robots page:

Note also that globbing and regular expression are not supported in either
the  User-agent or Disallow lines. The '*' in the User-agent field is a 
special value meaning "any robot". Specifically, you cannot have lines like 
"User-agent: *bot*",     "Disallow: /tmp/*" or "Disallow: *.gif".

You would either need to do what Arenstar or Tom O'Connor recommend (that is, use an Apache ACL to block them, drop the traffic at the IP level) or, possibly, route the IP addresses via 127.0.0.1 (that'd stop them from establishing TCP sessions in the first place).

Long-term, consider if you can place all your CPU-intensive pages under a common prefix, then you'll be able to use robots.txt to instruct crawlers to stay away from them.

Vatine
  • 5,390
  • 23
  • 24
4

Use a robots.txt file in your document root directory firstly.. Spiders and Bots normally look for this file before beginning the scan..

Use a .htaccess file ( this could also be put in your apache configs, though needs syntax change )

   RewriteEngine on
   RewriteCond %{HTTP_USER_AGENT} ^googlebot
   RewriteRule ^(.*)$ http://google.com/

http://www.besthostratings.com/articles/block-bad-bots.html

Hope this helps.. :D

Arenstar
  • 3,592
  • 2
  • 24
  • 34
  • My robots.txt is currently "User-agent: * Disallow: *.kmz" and it is still grabbing the CPU intensive kmz files. I need to kill it at the apache level. – nbv4 Nov 12 '10 at 06:46
  • I just updated with a link that will should you either from htaccess or in your apache configs ( vhost config ) – Arenstar Nov 12 '10 at 06:49
  • This rewrite just tells the agent to go annoy google.com. It blocks as a side-effect, but it's not the correct response code. – vy32 Oct 20 '12 at 18:44
4

If you know the googlebot's IP address, you could set a DROP rule in iptables, but that's a real hack.

iptables -I INPUT -s [source ip] -j DROP

where [source ip] is the googlebot's IP.

This'd definitely stop them, instantly, but it's a bit.. low level.

To unblock

iptables -D INPUT -s [source ip] -j DROP
Tom O'Connor
  • 27,440
  • 10
  • 72
  • 148
  • Wow! Simple and really instant solution. Nice – Saurabh Barjatiya Nov 12 '10 at 12:44
  • However not an intelligent solution.. What if googlebots use a range of ipaddresses? or if they change.. This solution is absolutely not flexible.. as said its just a hack.. – Arenstar Nov 13 '10 at 18:31
  • @Arenstar If you've got a server that's dying, because a single IP address is causing a disproportionate amount of traffic, this is perfect, but by no means ideal. – Tom O'Connor Nov 14 '10 at 10:07
3

Assuming you don't actually want your site delisted from Google (which the accepted answer will eventually cause) set a crawl delay value for your site in Google Webmaster Tools. It is reported that Google does not support Crawl-Delay in robots.txt, though you may wish to set that value for other search engines and crawlers to use.

Michael Hampton
  • 237,123
  • 42
  • 477
  • 940
1

We wanted to block a specific directory from robots. We had a robots.txt entry but it's being ignored by many robots. So we added this snippit below to our apache configuration file; note that we uncommented the Wget because we wanted to allow that. It works by blocking based on the HTTP_USER_AGENT.

The list comes (obviously) from http://www.javascriptkit.com/howto/htaccess13.shtml; when we modify configuration files with information we get from the Web we always put the back-pointer so we know where it came from.

    <Directory "/var/www/domaintoblock/directorytoblock/">

            # Block bots; from http://www.javascriptkit.com/howto/htaccess13.shtml                    
            # Note that we allow wget                                                                 
            RewriteEngine On
            RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]
            RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [OR]
            RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR]
            RewriteCond %{HTTP_USER_AGENT} ^Custo [OR]
            RewriteCond %{HTTP_USER_AGENT} ^DISCo [OR]
            RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR]
            RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR]
            RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR]
            RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
            RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
            RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR]
            RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
            RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR]
            RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR]
            RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR]
            RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [OR]
            RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR]
            RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR]
            RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR]
            RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR]
            RewriteCond %{HTTP_USER_AGENT} ^HMView [OR]
            RewriteCond %{HTTP_USER_AGENT} HTTrack [NC,OR]
            RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [OR]
            RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [OR]
            RewriteCond %{HTTP_USER_AGENT} Indy\ Library [NC,OR]
            RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR]
            RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR]
            RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR]
            RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [OR]
            RewriteCond %{HTTP_USER_AGENT} ^larbin [OR]
            RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR]
            RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR]
            RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR]
            RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR]
            RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR]
            RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR]
            RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR]
            RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR]
            RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR]
            RewriteCond %{HTTP_USER_AGENT} ^NetZIP [OR]
            RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR]
            RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [OR]
            RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [OR]
            RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR]
            RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR]
            RewriteCond %{HTTP_USER_AGENT} ^pavuk [OR]
            RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR]
            RewriteCond %{HTTP_USER_AGENT} ^RealDownload [OR]
            RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR]
            RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR]
            RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR]
            RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR]
            RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR]
            RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR]
            RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR]
            RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR]
            RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR]
            RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR]
            RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR]
            RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR]
            RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR]
            RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR]
            RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [OR]
            RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [OR]
            RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR]
            RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR]
            RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR]
            RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [OR]
            RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR]
            RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR]
            RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR]
            #RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]                                                
            RewriteCond %{HTTP_USER_AGENT} ^Widow [OR]
            RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR]
            RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR]
            RewriteCond %{HTTP_USER_AGENT} ^Zeus
            RewriteRule ^.* - [F,L]
</Directory>
vy32
  • 2,018
  • 1
  • 15
  • 20