1

Does there exist a forward proxy server that will lookup and obey robots.txt files on remote internet domains and enforce them on behalf of requesters going via the proxy?

e.g. Imagine a website at www.example.com that has a robots.txt file that restricts certain URLs and applies Crawl-Delays to others.

Multiple automatic clients (e.g. crawlers, scrapers) could then, going via the proxy, access the website at www.example.com without violating the robots.txt directives AND without having to access the file themselves (=> simpler clients and less requests to get robots.txt)

(Specifically, I am looking at the "GYM2008" version of the spec - http://nikitathespider.com/python/rerp/#gym2008 - because it's in wide use)

wodow
  • 590
  • 1
  • 5
  • 18

1 Answers1

3

I'm not sure why enforcing compliance with robots.txt would be the job of a proxy: The crawler (robot) is supposed to pull robots.txt and follow the instructions contained in that file, so as long as the proxy returns the correct robots.txt data and the crawler Does The Right Thing with that data, and as long as the crawler supports using a proxy, you'll get all the benefits of a proxy with no work required.

**

That said, I don't know of any proxy that does what you seem to be asking for (parse robots.txt from a site and only return things that would be allowed by that file -- presumably to control a crawler bot that doesn't respect robots.txt?). Writing a proxy that handles this would require doing a user-agent-to-robots.txt mapping/check for every request the proxy receives, which is certainly possible (You can do it in Squid, but you'd need to bang together a script to turn robots.txt into squid config rules and update that data periodically), but would undoubtedly be an efficiency hit on the proxy.
Fixing the crawler is the better solution (it also avoids "stale" data being sent to the crawler by the proxy. Note that a good crawler bot will check update times in the HTTP headers and only fetch pages if they've changed...)

voretaq7
  • 79,345
  • 17
  • 128
  • 213
  • Thanks for answering! My thought is that the proxy can act as a central, single point of obedience to any `robots.txt` rules encountered - it means that **multiple** crawlers can start from one network, go through a proxy, and all, in **aggregate** obey any `robots.txt` rules encountered. – wodow Jan 03 '12 at 21:49
  • A program to continually update a Squid config could be a good solution if nothing already does this - thanks. – wodow Jan 03 '12 at 21:51
  • Again, the *right* thing ("good solution") is for the *crawler* to respect robots.txt -- Trying to make a proxy do this is adding a layer of unneeded complexity & maintenance, doing something the crawlers should (by the RFC) be doing anyway. Ultimately it's your network so implement the solution that you think makes sense, but as an admin I would not want the technical and bureaucratic responsibility of the proxy server / management software / periodic updates on my plate if I could avoid it :) – voretaq7 Jan 03 '12 at 22:13
  • I agree when you look at it like that, but arguably it's an issue of terminology. So, as different model, define "crawler" to be a system of components - a forward web proxy and numerous agents behind it. The whole thing can be dubbed "a crawler" because that is what it looks like to the outside world. But now we can have multiple, probably heterogeneous requesting agents at the back and a single point at the front responsible for obeying remote policy (i.e. `robots.txt`). It's interesting! – wodow Jan 03 '12 at 22:44