6

I have a "content" website that some leechers and 419 scammers love to crawl agressively which also generates costs and performance issue. :( I have no choice: I need to prevent them to access the sitemap files and index. :(

I am doing the same as Facebook: I generate a sitemap index on the fly (/sitemap.php). I whitelisted the "good" crawlers with DNS reverse lookup (PHP) and agent check (Same as Stackoverflow). To prevent whitelisted engines to make the sitemap index content public I added that header (Stackoverflow forgot it):

header('Content-type: application/xml; charset="UTF-8"', true);
header('Pragma: no-cache');
header('X-Robots-Tag: NOARCHIVE');

Question 1: Am I missing something to protect the sitemap index file?

Question 2: The problem comes from the static sitemap (.xml.gz) files generated. How can I protect them? Even if they have a "hard to guess" name, they can be found easily with a simple google query (example: "site:stackoverflow.com filetype:xml") and I have a very limited access to .htaccess.

EDIT: This is not a server config issue. Prefered language is PHP.

EDIT 2: Sorry, this is pure programmatic question, but it has been transfered from SO and I cannot close/delete it. :(

Toto
  • 283
  • 1
  • 4
  • 12
  • Can you register PHP as a handler for .gz and replicate sitemap.php functionality? –  Feb 08 '10 at 18:29

4 Answers4

4

You could always use a URL for the sitemap which will not be disclosed to anyone else apart from the engines that you'll explicitly submit to.

Have a look at http://en.wikipedia.org/wiki/Sitemaps

cherouvim
  • 744
  • 3
  • 18
  • 37
  • The site maps files will be "foundable" in the search engines' cache. :( – Toto Feb 08 '10 at 17:51
  • 2
    @Toto: I don't think they are. The example you posted holds only because someone linked to this file: http://meta.stackexchange.com/questions/22308/stackoverflow-sitemap-wtf – cherouvim Feb 08 '10 at 17:55
3

You should use a whitelist and only allow good search engines access to these sitemap files like Google and Bing.

This is a huge problem that I'm afraid most people don't even consider when submitting sitemap files to Google and Bing. I track every request to my xml sitemap files and I've denied access to over 6,500 IPs since I started doing this (3 months ago). Only Google, Bing, and a few others only ever to get to view these files now.

Since you are using a whitelist and not a blacklist, they can buy all the proxies they want and they will never get through. Also, you should perform a reverse DNS lookup as well before you whitelist and IP to make sure they really are from Google or Bing. As for how to do this in PHP, I have no idea as we are a Microsoft shop and only do ASP.NET development. I would start by getting the range of IPs that Google and Bing run their bots out of, then when a request comes in from one of those IPs, perform a DNS lookup and make sure "googlebot" or "msnbot" is in the DNS name, if it is, then perform a reverse DNS lookup against that name to make sure that the IP Address returned matches the original IP Address. If it does, then you can safely allow the IP to view your sitemap file, if it doesn't, deny access and 404 the jokers. I got that technique talking to a Google techie BTW so it's pretty solid.

Note, I own and operate a site that does around 4,000,000 page views a month so for me this was a huge priority as I didn't want my data that easily scrapped. Also, I employ the use of recaptcha after 50 page requests from the same IP in a 12 hour period and that really works well to weed out bots.

I took the time to write this post as I hope it will help someone else out and shed some light on what I think is a problem that goes largely unnoticed.

  • I use this same approche. I have noticed that the reverse lookup can be an issue with Bing as some of their bot seem not to be properly configured. :) – Toto Jul 16 '13 at 16:00
1

How about not creating sitemap.php on the fly? Instead regenerate it once a day (or whatever makes sense) and serve it up as a static file. That way, even if 10,000 crawlers a day request it—so what?

wallyk
  • 220
  • 2
  • 12
  • scammers will have an easy way to run scripts to contact and try to scam our users and "duplicate" the content generated by our users. – Toto Feb 08 '10 at 17:57
0

You could use robots.txt to disallow the file but you could also block the IP's. A simple way to do this is to look at the HTTP referrers in your web logs and write a cron job to take those IP's (by referrer) and add them to hosts.deny for your website.

Mech
  • 660
  • 2
  • 10
  • 22
  • The blacklist strategy is unfortunately not possible: the scammers and leechers buy open proxies lists. :( – Toto Feb 08 '10 at 17:49
  • Ok, given that, do the referrers still have non-standard browser entries? If so, you can probably just change your PHP code to look through an array and see if the referrer matches a known type. Then, change your sitemap.xml to sitemap.php and use .htaccess to rename it to .xml. Then, if your sitemap is accessed by a bot, you can redirect it anywhere. Even if your .xml file is "Static" it can be renamed and read by the PHP script in a protected directory. – Mech Feb 08 '10 at 18:00