1

I'm planning of adding a bunch of aggregated lists of pages in my sitemaps that I don't want make it too easy for outsiders to screnscrape. Can I protect my sitemap.xml so that only searchengines can download it?

Install a firewall? I'm using IIS6.

Niels Bosma
  • 243
  • 1
  • 4
  • 15
  • 2
    What would be the difference between the request coming from a search engine and the request coming from a scraper introducing itself as a search engine? Are there pages which can only be found using the sitemap instead of recursively following links at your pages? – andol Dec 14 '09 at 06:32

3 Answers3

2

Off the top of my head, you could do rewrite rules that redirected requests for the sitemap.xml to a 404 page, if they don't match the correct user-agent or IP addresses.

I don't have such a rewrite rule, but I'm 99% sure it's possible.

Jeff Atwood
  • 12,994
  • 20
  • 74
  • 92
2

As Dennis pointed out, spoofing this would be easy. Also, making sure that you didn't accidentally exclude a search engine would be hard.

Let's say you want to allow Google, Yahoo, and Bing to spider your site. So, you only allow access to the sitemap for the associated user agents. There are two problems, now:

What if a service changes the user-agent? What if you need to include a different service? You now have to rewrite your rules before the service will be able to see the sitemap.

Why wouldn't I, as a sitescraper, simply fradulently report that I'm a Google spider? Specifying a user agent is possible (and easy) in a number of different languages, plus many browsers like FireFox and Safari.

So, the short answer is, "No, but you can make it harder. But this puts a burden on you."

Ben Doom
  • 684
  • 3
  • 6
2

How can you know what is and isn't a search engine? User agents are replaceable -- but leaving that aside, if you encounter an unknown user agent, do you know if it's a browser or a search engine? There are hundreds of companies with search engines, so simply allowing IPs from Google, Bing, et al is hardly sufficient here.

Trying to keep sitemaps from search engines is a form of security through obscurity, and anyone who cares won't be blocked by any reasonable attempts to stop them.

Jon Lasser
  • 960
  • 5
  • 7