robots.txt is redirecting to default page

Question

Hullo,

Typically, if I type into my address bar, "oneofmysites.com/robots.txt", any browser will display the content of robots.txt. As you can see, this is pretty standard behaviour.

I have just one web server which does not. Instead, robots.txt redirects to the default web page (i.e. "thesiteinquestion.com/"). This notable difference (only one of seven sites) worries me.

Questions: Is this something to be concerned about? If so, what is the likely error that I am missing?

Notes:

This site is the only one with a separate service provider that I use.
CentOS release 6.10 (Final)
Webmin
robots.txt file permissions are 644

score 6 · Accepted Answer · answered Feb 06 '19 at 22:09

6

It depends on the server configuration, .txt files may not be allowed. It is possible that there is a rule somewhere in the config or some .htaccess that specifies if a url doesn't match a certain pattern (say .html, .php, .htm, etc) it then redirects the rest to the index page of the web root.

answered Feb 06 '19 at 22:09

Serge Rivest

76
1

2

Well blue blistering barnacles! You are right. And I did it to myself with this rewrite: `RewriteRule \.(gif|jpg|js|txt)$ https://www.thesiteinquestion.com/index.php [L]`. I did this to prevent direct access, but I forgot that I added txt files as well. Comment it out, and it works a trice. Question: is there anyway to conditionally exclude files (this robots.txt file, in particular) from a rewrite? – Parapluie Feb 07 '19 at 00:34
@Parapluie possibly with a rule that allows robots.txt before the one you have there. I think the webserver would go sequentially through the rules and act on the first match. So if it matches robots\.txt then it will act on that line.Examples here: https://serverfault.com/questions/213422/how-to-create-robots-txt-file-for-all-domains-on-apache-server – Serge Rivest Feb 11 '19 at 22:09

score 1 · Answer 2 · answered Feb 06 '19 at 22:18

1

To add a bit of information, the web provider is not at all forced to respect the robots.txt standard, thus can make what ever he want with it and like Serge told it can be redirected anywhere.

answered Feb 06 '19 at 22:18

yagmoth555

16,300
4
26
48

The "web provider" is not forced to respect the standard? Am I misunderstanding?: Do you mean the crawler? – Parapluie Feb 07 '19 at 00:35
@Parapluie I mean the hoster is not forced to follow the robots.txt standard, and thus crawler must adapt to such case – yagmoth555 Feb 07 '19 at 00:37
That is interesting and germane. Thankfully, I have full access to the config in this case (even though my having access was the problem in the first place, at least I can fix it!) Thanks! – Parapluie Feb 07 '19 at 00:40

WGroleau · Answer 3 · 2019-02-08T04:15:09.157

1

A crawler should read robots.txt and follow its restrictions, but the web server cannot enforce this.

.htaccess (or the server confía file) can be used to keep out crawlers that don’t comply, if you know who they are.

edited Feb 08 '19 at 04:15

answered Feb 08 '19 at 02:15

WGroleau

111
3

Yes, indeed. I am currently using a jail script to ban IPs who ignore the robots.txt directives. [i.e.](https://softwareengineering.stackexchange.com/questions/180108/what-will-happen-if-i-dont-follow-robots-txt-while-crawling) – Parapluie Feb 08 '19 at 16:45

robots.txt is redirecting to default page

3 Answers3