1

I want to scrape our university's learning platform website, to let myself know via notifications when a new entry added to any lesson.

But, I'm scared that they'll put robots.txt afterwards and sue me or something, I don't know. I just don't have any experience of this. I just know that I should look at robots.txt before scraping any website.

And I think they've just forgotten the put it for know.

Anyways, how do I ensure beforehand and take proof of it that it didn't exists when I was scraping. Anything that my proof is valid.

Kenan
  • 13
  • 2

1 Answers1

2

robots.txt means nothing

The Simpsons explain it pretty well:

Simpsons joke "Keep out! - Or enter. I'm a sign, not a cop"

robots.txt is not an "access restriction", but instead merely a polite request to a complying web crawler not to index something. A web crawler can simply disregard this file and index whatever it wants anyways.

If you want to be sure, simply send them an e-mail and ask for permission. Or you know, just do it. A web-crawler that runs once an hour and does a few hundred requests with one request per 500ms won't disturb any server.

  • re: "_one request per 500ms won't disturb any server._"; depends on the server and lms platform. For example, Moodle running a class with 500 students and many grade entries can definitely take longer than 500ms to respond to some queries. – dandavis Mar 01 '21 at 18:01
  • @dandavis I'm assuming a usual web server with usual response times. –  Mar 01 '21 at 19:59