3

I'd like to block some spiders and bad bots by user-agent text string for all of my virtual hosts via httpd.conf but have yet to find success. Below are the contents of my http.conf file. Any ideas why this isn't working? env_module is loaded.

SetEnvIfNoCase User-Agent "^BaiDuSpider" UnwantedRobot
SetEnvIfNoCase User-Agent "^Yandex" UnwantedRobot
SetEnvIfNoCase User-Agent "^Exabot" UnwantedRobot
SetEnvIfNoCase User-Agent "^Cityreview" UnwantedRobot
SetEnvIfNoCase User-Agent "^Dotbot" UnwantedRobot
SetEnvIfNoCase User-Agent "^Sogou" UnwantedRobot
SetEnvIfNoCase User-Agent "^Sosospider" UnwantedRobot
SetEnvIfNoCase User-Agent "^Twiceler" UnwantedRobot
SetEnvIfNoCase User-Agent "^Java" UnwantedRobot
SetEnvIfNoCase User-Agent "^YandexBot" UnwantedRobot
SetEnvIfNoCase User-Agent "^bot*" UnwantedRobot
SetEnvIfNoCase User-Agent "^spider" UnwantedRobot
SetEnvIfNoCase User-Agent "^crawl" UnwantedRobot
SetEnvIfNoCase User-Agent "^NG\ 1.x (Exalead)" UnwantedRobot
SetEnvIfNoCase User-Agent "^MJ12bot" UnwantedRobot

<Directory "/var/www/">
    Order Allow,Deny
    Allow from all
    Deny from env=UnwantedRobot
</Directory>
<Directory "/srv/www/">
    Order Allow,Deny
    Allow from all
    Deny from env=UnwantedRobot
</Directory>

EDIT - @Shane Madden: I do have .htaccess files in each virtual host's document root with the following.

order allow,deny
deny from xxx.xxx.xxx.xxx
deny from xx.xxx.xx.xx
deny from xx.xxx.xx.xxx
...
allow from all

Could that be creating conflict? Sample VirtualHost config:

<VirtualHost xx.xxx.xx.xxx:80>
 ServerAdmin admin@domain.com
 ServerName domain.com
 ServerAlias www.domain.com
 DocumentRoot /srv/www/domain.com/public_html/
 ErrorLog "|/usr/bin/cronolog /srv/www/domain.com/logs/error_log_%Y-%m"
 CustomLog "|/usr/bin/cronolog /srv/www/domain.com/logs/access_log_%Y-%m"     combined
</VirtualHost>

2 Answers2

1

Try this, and if it fails, try it in a .htaccess file...

   #Bad bot removal
   RewriteEngine on
   RewriteCond %{HTTP_USER_AGENT} ^useragent1 [OR]
   RewriteCond %{HTTP_USER_AGENT} ^useragent2 [OR]
   RewriteCond %{HTTP_USER_AGENT} ^useragent3
   RewriteRule ^(.*)$ http://website-you-want-to-send-bad-bots-to.com

Follow this pattern, and don't put an [OR] on the very last one.

EDIT: New solution:

If you want to block all (friendly) bots, make a file called "robots.txt" and put it in where your index.html is. Inside it, put this:

User-agent: *
Disallow: /

You'd still need to maintain a list like my original answer (above) to disallow the bots that ignore robots.txt.

U4iK_HaZe
  • 631
  • 5
  • 13
  • Done. Will need to check on the results later. Thanks. – Ferdinand.Bardamu Sep 12 '11 at 21:27
  • Yep. Hope you're successful. If you want to test it, use IE. Google for "Change UA in IE" and it's a toolbar addon thing, where you can set your UA to anything you like. Then try visiting your site. See my edit. – U4iK_HaZe Sep 12 '11 at 21:30
  • Thanks. I'd like to have a single file to manage so am ignoring robots.txt for now. (And I don't want to block all friendly bots, just those outside target markets.) – Ferdinand.Bardamu Sep 12 '11 at 22:13
  • Fine by me, but robots.txt also allows the disallowment of specified bots, but the .htaccess or httpd.conf file is more secure: any robot can be told to disobey robots.txt. – U4iK_HaZe Sep 12 '11 at 22:27
0

For the benefit of those who may read this later, here's the deal:

I deleted the order allow, deny directives from my .htaccess files and was able to trigger the expected behavior for certain user-agents when I spoofed them with User Agent Switcher in Firefox, so it does appear that there was some conflict. Other user-agents on my list, however, were not blocked -- but that's because I was unclear as to the significance of the carat (^) as used in my httpd.conf. The Regular Expression tutorials I read stated this but it didn't really sink in at first: the carat forces the server to look only at the very beginning of the entire user-agent string (not individual strings within, as I originally thought) when parsing the connection request. As the key identifying string for some of the spiders & bots I wish to block occurs later in the user-agent string, I needed to drop the carat to get things working.