Does there exist a forward proxy server that will lookup and obey robots.txt
files on remote internet domains and enforce them on behalf of requesters going via the proxy?
e.g. Imagine a website at www.example.com that has a robots.txt
file that restricts certain URLs and applies Crawl-Delays to others.
Multiple automatic clients (e.g. crawlers, scrapers) could then, going via the proxy, access the website at www.example.com without violating the robots.txt
directives AND without having to access the file themselves (=> simpler clients and less requests to get robots.txt
)
(Specifically, I am looking at the "GYM2008" version of the spec - http://nikitathespider.com/python/rerp/#gym2008 - because it's in wide use)