10

We have a XAMPP Apache development web server setup with virtual hosts and want to stop serps from crawling all our sites. This is easily done with a robots.txt file. However, we'd rather not include a disallow robots.txt in every vhost and then have to remove it when we went live with the site on another server.

Is there a way with an apache config file to rewrite all requests to robots.txt on all vhosts to a single robots.txt file?

If so, could you give me an example? I think it would be something like this:

RewriteEngine On
RewriteRule  .*robots\.txt$         C:\xampp\vhosts\override-robots.txt [L] 

Thanks!

Mike B
  • 203
  • 1
  • 2
  • 6
  • robots.txt is not mandatory and some crawlers will ignore it. It should not be seen as a security feature. If you want to hide your site until it is ready for public, add authentication. – Mircea Vutcovici Dec 16 '10 at 19:21

4 Answers4

25

Apache mod_alias is designed for this and available from the core Apache system, and can be set in one place with almost no processing overhead, unlike mod_rewrite.

Alias /robots.txt C:/xampp/vhosts/override-robots.txt

With that line in the apache2.conf file, outside all the vhost's, http://example.com/robots.txt - on any website it serves, will output the given file.

Alister Bulman
  • 1,624
  • 13
  • 13
  • This. Put an `Alias` in each `` block. +1. – Steven Monday Dec 16 '10 at 21:01
  • Thanks! That worked perfectly. I knew there was an easy solution... – Mike B Dec 16 '10 at 21:04
  • If you want it on every single virtual-host, you don't need to put it into all of them. It can go on a global level, like the default /manual alias does out of the box. – Alister Bulman Dec 16 '10 at 21:52
  • Thanks for the solution although seeing C:/ in there makes me sick to my stomach knowing there is another windows server out there :) I put my edit in my modules.conf file or in mods-enabled/alias.conf like so: Alias /robots.txt /var/www/robots.txt – unc0nnected Oct 11 '12 at 16:53
  • @AlisterBulman, is it possible to use mod_alias to append to each domain's (if pre-existing) robots.txt. I just want some general rules in place but I need to allow different domains to have different rules. – Gaia Nov 02 '12 at 22:01
  • 1
    To make sure this file is available even when other access controls will block it, put the the alias, and ` Allow from all ` immediately after it, inside the main `` – Walf Jun 17 '13 at 04:34
1

Put your common global robots.txt file somewhere in your server's filesystem that is accessible to the apache process. For the sake of illustration, I'll assume it's at /srv/robots.txt.

Then, to set up mod_rewrite to serve that file to clients who request it, put the following rules into each vhost's <VirtualHost> config block:

RewriteEngine on
RewriteRule ^/robots.txt$ /srv/robots.txt [NC, L]

If you're putting the rewrite rules into per-directory .htaccess files rather than <VirtualHost> blocks, you will need to modify the rules slightly:

RewriteEngine on
RewriteBase /
RewriteRule ^robots.txt$ /srv/robots.txt [NC, L]
Steven Monday
  • 13,019
  • 4
  • 35
  • 45
  • Could you explain this "Put your common global robots.txt file somewhere in your server's filesystem that is accessible to the apache process. For the sake of illustration, I'll assume it's at /srv/robots.txt." in more detail? I need to know what you mean by creating a directory available to the apache process? – Mike B Dec 16 '10 at 20:55
  • Each site is contained in a folder like testsite.int.devcsd.com under C:\xampp\vhosts – Mike B Dec 16 '10 at 20:56
  • @Michael: Don't bother with this overly complicated `mod_rewrite` hack. Use `Alias` instead, as suggested by Alister. – Steven Monday Dec 16 '10 at 21:02
0

Not sure if you're running XAMPP on Linux or not, but if you are, you could create a symlink from all virtual hosts to the same robots.txt file, but you need to make sure that your Apache configuration for each virtual host is allowed to follow symlinks (under the <Directory> directive's Options FollowSymLinks).

gravyface
  • 13,947
  • 16
  • 65
  • 100
  • I'd rather not have to edit every single vhost declaration. There are over 30... Plus, I want it to be an automatic over ride so that I don't have to do anything when a create a new vhost. – Mike B Dec 16 '10 at 20:46
  • Michael, just use sed to do a mass edit, pretty easy stuff, you definitely don't need to do it manually. Laid out how to do it here at the bottom: http://blog.netflowdevelopments.com/2012/10/11/preventing-a-server-melt-down-and-saving-resources-block-all-shit-bot/ – unc0nnected Oct 11 '12 at 17:40
0

Different approach to solution.

I host multiple (more than 300) virtualhost in my cluster environment. In order to protect my servers from being hammered down by crawlers, i define Crawl-delay for 10 seconds.

However, i cannot enforce all my clients with a fixed robots.txt configuration. I let my clients to use their own robots.txt if they wish to do.

Rewrite module first checks if the file exist. If it does not exist, modules rewrites to my default configuration. Code example below...

In order to keep rewrite internal, alias should be used. Instead of defining a new alias which can cause some user side conflicts, i located my robots.txt inside /APACHE/error/ folder which already has an alias as default configuration.

<Directory /HOSTING/*/*/public_html>
        Options SymLinksIfOwnerMatch
        <Files robots.txt>
                RewriteEngine On
                RewriteCond %{REQUEST_FILENAME} -f [OR]
                RewriteCond %{REQUEST_FILENAME} -d
                RewriteRule (.*) - [L]
                RewriteRule (.*) /error/robots.txt [L]
        </Files>
</Directory>
aesnak
  • 561
  • 4
  • 12