I'm planning of adding a bunch of aggregated lists of pages in my sitemaps that I don't want make it too easy for outsiders to screnscrape. Can I protect my sitemap.xml so that only searchengines can download it?
Install a firewall? I'm using IIS6.
I'm planning of adding a bunch of aggregated lists of pages in my sitemaps that I don't want make it too easy for outsiders to screnscrape. Can I protect my sitemap.xml so that only searchengines can download it?
Install a firewall? I'm using IIS6.
Off the top of my head, you could do rewrite rules that redirected requests for the sitemap.xml to a 404 page, if they don't match the correct user-agent or IP addresses.
I don't have such a rewrite rule, but I'm 99% sure it's possible.
As Dennis pointed out, spoofing this would be easy. Also, making sure that you didn't accidentally exclude a search engine would be hard.
Let's say you want to allow Google, Yahoo, and Bing to spider your site. So, you only allow access to the sitemap for the associated user agents. There are two problems, now:
What if a service changes the user-agent? What if you need to include a different service? You now have to rewrite your rules before the service will be able to see the sitemap.
Why wouldn't I, as a sitescraper, simply fradulently report that I'm a Google spider? Specifying a user agent is possible (and easy) in a number of different languages, plus many browsers like FireFox and Safari.
So, the short answer is, "No, but you can make it harder. But this puts a burden on you."
How can you know what is and isn't a search engine? User agents are replaceable -- but leaving that aside, if you encounter an unknown user agent, do you know if it's a browser or a search engine? There are hundreds of companies with search engines, so simply allowing IPs from Google, Bing, et al is hardly sufficient here.
Trying to keep sitemaps from search engines is a form of security through obscurity, and anyone who cares won't be blocked by any reasonable attempts to stop them.