I have a "content" website that some leechers and 419 scammers love to crawl agressively which also generates costs and performance issue. :( I have no choice: I need to prevent them to access the sitemap files and index. :(
I am doing the same as Facebook: I generate a sitemap index on the fly (/sitemap.php). I whitelisted the "good" crawlers with DNS reverse lookup (PHP) and agent check (Same as Stackoverflow). To prevent whitelisted engines to make the sitemap index content public I added that header (Stackoverflow forgot it):
header('Content-type: application/xml; charset="UTF-8"', true);
header('Pragma: no-cache');
header('X-Robots-Tag: NOARCHIVE');
Question 1: Am I missing something to protect the sitemap index file?
Question 2: The problem comes from the static sitemap (.xml.gz) files generated. How can I protect them? Even if they have a "hard to guess" name, they can be found easily with a simple google query (example: "site:stackoverflow.com filetype:xml") and I have a very limited access to .htaccess.
EDIT: This is not a server config issue. Prefered language is PHP.
EDIT 2: Sorry, this is pure programmatic question, but it has been transfered from SO and I cannot close/delete it. :(