Can I protect my sitemap.xml so that only searchengines can download it?

Question

I'm planning of adding a bunch of aggregated lists of pages in my sitemaps that I don't want make it too easy for outsiders to screnscrape. Can I protect my sitemap.xml so that only searchengines can download it?

Install a firewall? I'm using IIS6.

What would be the difference between the request coming from a search engine and the request coming from a scraper introducing itself as a search engine? Are there pages which can only be found using the sitemap instead of recursively following links at your pages? — andol, Dec 14 '09 at 06:32

score 2 · Accepted Answer · answered Dec 14 '09 at 06:46

2

Off the top of my head, you could do rewrite rules that redirected requests for the sitemap.xml to a 404 page, if they don't match the correct user-agent or IP addresses.

I don't have such a rewrite rule, but I'm 99% sure it's possible.

answered Dec 14 '09 at 06:46

Jeff Atwood

12,994
20
74
92

I have three words for that idea: "spoof". – Dennis Williamson Dec 14 '09 at 09:25

score 2 · Answer 2 · answered Dec 14 '09 at 17:02

As Dennis pointed out, spoofing this would be easy. Also, making sure that you didn't accidentally exclude a search engine would be hard.

Let's say you want to allow Google, Yahoo, and Bing to spider your site. So, you only allow access to the sitemap for the associated user agents. There are two problems, now:

What if a service changes the user-agent? What if you need to include a different service? You now have to rewrite your rules before the service will be able to see the sitemap.

Why wouldn't I, as a sitescraper, simply fradulently report that I'm a Google spider? Specifying a user agent is possible (and easy) in a number of different languages, plus many browsers like FireFox and Safari.

So, the short answer is, "No, but you can make it harder. But this puts a burden on you."

score 2 · Answer 3 · answered Dec 14 '09 at 18:38

How can you know what is and isn't a search engine? User agents are replaceable -- but leaving that aside, if you encounter an unknown user agent, do you know if it's a browser or a search engine? There are hundreds of companies with search engines, so simply allowing IPs from Google, Bing, et al is hardly sufficient here.

Trying to keep sitemaps from search engines is a form of security through obscurity, and anyone who cares won't be blocked by any reasonable attempts to stop them.

Can I protect my sitemap.xml so that only searchengines can download it?

3 Answers3