Questions tagged [robots.txt]

Convention to prevent webcrawlers from indexing your website.

If a site owner wishes to give instructions to web robots they must place a text file called robots.txt in the root of the web site hierarchy (e.g. www.example.com/robots.txt). This text file should contain the instructions in a specific format (see examples below). Robots that choose to follow the instructions try to fetch this file and read the instructions before fetching any other file from the web site. If this file doesn't exist web robots assume that the web owner wishes to provide no specific instructions.

A robots.txt file on a website will function as a request that specified robots ignore specified files or directories when crawling a site. This might be, for example, out of a preference for privacy from search engine results, or the belief that the content of the selected directories might be misleading or irrelevant to the categorization of the site as a whole, or out of a desire that an application only operate on certain data. Links to pages listed in robots.txt can still appear in search results if they are linked to from a page that is crawled.

For websites with multiple subdomains, each subdomain must have its own robots.txt file. If example.com had a robots.txt file but a.example.com did not, the rules that would apply for example.com would not apply to a.example.com.

Source: wikipedia

85 questions
0
votes
1 answer

Is there a chance to block images spiders / bots on dedicated servers without using robots.txt or .htaccess?

We know that we can block certain spiders from crawling websites pages using robots.txt or .htaccess or maybe via the Apache configuration File httpd.conf. But that requires to edit may be a large number of sites on some dedicated servers and bots…
hsobhy
  • 171
  • 1
  • 2
  • 10
0
votes
1 answer

robots.txt for subdomain iis7

I have two different sites in iis7 both point to the same folder they have different subdomains www.sitename.com foo.sitename.com they are essentially the same website, but it runs different logic depending on the subdomain. i want www.sitename.com…
Crudler
  • 207
  • 1
  • 3
  • 10
0
votes
1 answer

If I redirect all users (Except me) from within .htaccess, do I need a robots.txt file

So.. I have my live version of my site e.g v1.0 at domain.com I then have my development testing version at testing.domain.com I want testing.domain.com to only be accessible to me for testing, and as such I redirect all other IPs in my .htaccess…
0
votes
1 answer

webcrawler bots load test my website and it fails the test

we run a commerical website with a relavitely few number of customer at any one time ~30 users . Frequently a webcrawler such as google bot, bing bot, and 80legs will bring our site to a grinding halt. Altering robots.txt does not have an immediate…
NimChimpsky
  • 460
  • 2
  • 5
  • 17
0
votes
2 answers

How can I use varnish to generate a robots.txt file even for subdomain of the same site?

I want to generate a robots.txt file using Varnish 2.1. That means that domain.com/robots.txt is served using Varnish and also subdomain.domain.com/robots.txt is also served using Varnish. The robots.txt must be hardcoded into default.vcl file. is…
Sam
  • 1
  • 2
0
votes
1 answer

Disallow xml robots.txt

Google webmaster FAQs suggest that this will exclude all xml files from search: User-agent: Googlebot Disallow: /*.xml$ Is this legal for other bots as well? User-agent: * Disallow: /*.xml$
Ben K.
  • 2,149
  • 4
  • 17
  • 15
0
votes
1 answer

Cross-submission robots.txt for multiple domains on single host

We are running a site with multiple languages hosted in a single environment on IIS7. For example, oursite.com - english oursite.de - german oursite.es - spanish This is a single-host environment. All of these sites are in the same application…
0
votes
2 answers

301 redirect or disallow on robots.txt?

I recently asked for 301 redirection on ServerFault and I didn't get a proper solution to my problem, but now I have a new idea: use the robots.txt to disallow certain URLs from my site to be "crawled". My problem was simple: after a migration from…
javipas
  • 1,292
  • 3
  • 23
  • 38
0
votes
2 answers

Is this a valid robots.txt file?

I have this robots.txt file: User-agent: * Sitemap: Path_to_sitemap.xml My Q is, should I have something else in there as well? Like allow All or something? Thanks
Anonymous12345
  • 1,012
  • 1
  • 12
  • 17
0
votes
2 answers

Quick Robots.txt question

Will the following robots.txt syntax correctly block all pages on the site that end in "_.php"? I don't want to accidentally block other pages. User-Agent: * Disallow: /*_.php Also, am I allowed to have both "Allow: /" and "Disallow:" commands…
bccarlso
  • 127
  • 2
  • 5
0
votes
1 answer

seems to block /my-beautiful-sef-url-123

I have robots.txt that looks like this: User-agent: * Disallow: /system/ Disallow: /admin/ Disallow: /index.php The obvious goal has been to prevent all the ugly URLs from being indexed, as they all begin with "/index.php". But for some reason all…
0
votes
0 answers

nginx configuration for robots.txt

I've read other answers and Nginx docs, and I can't figure out why this works: location = /robots.txt { alias //static/robots.allow.txt; } and this don't: location = /robots.txt { rewrite .* /robots.allow.txt last; } for the…
Nestor
  • 101
  • 1
0
votes
0 answers

Make webserver to prevent parsing of certain HTML elements

MediaWiki content management system creates many links which their webpages I want not to be discovered by search engine crawlers. It's not only that I don't want them indexed and more so not only that I don't want them crawled, but I don't even…
0
votes
0 answers

Cannot block YandexBot with mod_rewrite

We have an Apache httpd 2.4 server as our point of entry for about 20 web sites and each site has its own virtualhost configuration. A lot of settings are probably redundant but it suits our needs. Each virtualhost redirects http traffic to an https…
0
votes
0 answers

With Nginx / Node.js reverse proxy how does Nginx serve robots.txt despite txt files not being referenced in Nginx config's location blocks?

In Chrome when I enter https://www.example.com/robots.txt my robots.txt file is served and works fine. I'm happy that it works but I'm not sure why it does. In the config below I thought that my last location block, location / was a catch-all that…