Questions tagged [robots.txt]

Convention to prevent webcrawlers from indexing your website.

If a site owner wishes to give instructions to web robots they must place a text file called robots.txt in the root of the web site hierarchy (e.g. www.example.com/robots.txt). This text file should contain the instructions in a specific format (see examples below). Robots that choose to follow the instructions try to fetch this file and read the instructions before fetching any other file from the web site. If this file doesn't exist web robots assume that the web owner wishes to provide no specific instructions.

A robots.txt file on a website will function as a request that specified robots ignore specified files or directories when crawling a site. This might be, for example, out of a preference for privacy from search engine results, or the belief that the content of the selected directories might be misleading or irrelevant to the categorization of the site as a whole, or out of a desire that an application only operate on certain data. Links to pages listed in robots.txt can still appear in search results if they are linked to from a page that is crawled.

For websites with multiple subdomains, each subdomain must have its own robots.txt file. If example.com had a robots.txt file but a.example.com did not, the rules that would apply for example.com would not apply to a.example.com.

Source: wikipedia

85 questions
24
votes
5 answers

How to set robots.txt globally in nginx for all virtual hosts

I am trying to set robots.txt for all virtual hosts under nginx http server. I was able to do it in Apache by putting the following in main httpd.conf: SetHandler None Alias /robots.txt…
anup
  • 657
  • 4
  • 8
  • 19
23
votes
5 answers

How Can I Encourage Google to Read New robots.txt File?

I just updated my robots.txt file on a new site; Google Webmaster Tools reports it read my robots.txt 10 minutes before my last update. Is there any way I can encourage Google to re-read my robots.txt as soon as possible? UPDATE: Under Site…
qxotk
  • 1,434
  • 2
  • 15
  • 26
14
votes
5 answers

Which bots and spiders should I block in robots.txt?

In order to: Increase security of my website Reduce bandwidth requirements Prevent email address harvesting
DaveC
  • 243
  • 1
  • 7
10
votes
4 answers

How to create robots.txt file for all domains on Apache server

We have a XAMPP Apache development web server setup with virtual hosts and want to stop serps from crawling all our sites. This is easily done with a robots.txt file. However, we'd rather not include a disallow robots.txt in every vhost and then…
Mike B
  • 203
  • 1
  • 2
  • 6
8
votes
3 answers

How do I use robots.txt to disallow crawling for only my subdomains?

If I want my main website to on search engines, but none of the subdomains to be, should I just put the "disallow all" robots.txt in the directories of the subdomains? If I do, will my main domain still be crawlable?
tkbx
  • 201
  • 1
  • 2
  • 6
6
votes
4 answers

How do you create a single robots.txt file for all sites on an IIS instance

I want to create a single robots.txt file and have it served for all sites on my IIS (7 in this case) instance. I do not want to have to configure anything on any individual site. How can I do this?
Tim Erickson
  • 163
  • 1
  • 1
  • 5
5
votes
1 answer

Nginx robots.txt configuration

I can't seem to properly configure nginx to return robots.txt content. Ideally, I don't need the file and just want to serve text content configured directly in nginx. Here's my config: server { listen 80 default_server; listen [::]:80…
Denys S.
  • 225
  • 1
  • 4
  • 12
5
votes
6 answers

What happens if a website does not have a robots.txt file?

If the robots.txt file is missing in the root directory of a website, how are things treated as: the site is not indexed at all the site is indexed without any restrictions It should logically be the second one according to me. I ask in reference…
Lazer
  • 415
  • 3
  • 7
  • 9
5
votes
6 answers

Blocking yandex.ru bot

I want to block all request from yandex.ru search bot. It is very traffic intensive (2GB/day). I first blocked one C class IP range, but it seems this bot appear from different IP ranges. For example: spider31.yandex.ru ->…
Ross
  • 268
  • 1
  • 3
  • 9
3
votes
3 answers

robots.txt is redirecting to default page

Hullo, Typically, if I type into my address bar, "oneofmysites.com/robots.txt", any browser will display the content of robots.txt. As you can see, this is pretty standard behaviour. I have just one web server which does not. Instead, robots.txt…
Parapluie
  • 145
  • 9
3
votes
1 answer

Baidu Spider causing 3Gb of traffic a day - but I do business in China

I'm in a difficult situation, the Baidu spider is hitting my site causing about 3Gb a day worth of bandwidth. At the same time I do business in China so don't want to just block it. Has anyone else been in a similar situation (any spider)? Did you…
d.lanza38
  • 327
  • 1
  • 5
  • 13
3
votes
2 answers

Robots.txt - no follow, no index

Please can someone explain to me the difference between setting allow and disallow in a robots.txt file and create No follow, No index meta tags! Is it possible to set no follow and no index within the robots.txt file? I have look on…
Ian
3
votes
1 answer

Why is googlebot requesting robots.txt from my SSH server?

I run ossec on my server and periodically I receive a warning like this: Received From: myserver->/var/log/auth.log Rule: 5701 fired (level 8) -> "Possible attack on the ssh server (or version gathering)." Portion of the log(s): Nov 19 14:26:33…
Brian
  • 766
  • 1
  • 6
  • 14
3
votes
3 answers

How to prevent discovery of a secure URL?

If I have a url that is used for getting messages and I create it like so: http://www.mydomain.com/somelonghash123456etcetc and this URL allows for other services to POST messages to. Is it possible for a search engine robot to find it? I don't want…
lamp_scaler
  • 577
  • 1
  • 5
  • 18
3
votes
2 answers

Is it good idea to ban amazonaws.com

Site are crawled by anonymous bot hosted on amazon ec2. This robot doesn't respect robots.txt and creates high load on web server so I added check if reverse IP for request ends with "amazonaws.com" then server returns 403 page immediately. This…
valodzka
  • 177
  • 3
  • 10
1
2 3 4 5 6