0

In the current day and age we have the problem of malicious/spam crawlers and similar concerns.

My suggestion would be implementing cookie support for crawling and by that I mean giving specific cookies with crawler ID (at best refreshed using refresh tokens), which get you identify the crawler and the crawler using the cookie can access all the features of the website. This whole topic floats up only due to the fact that GDPR and similar legal frameworks exist, previously this whole thing would've been redundant.

Doing this would give the following benefits:

  • One could identify when the crawler and which crawler crawled at a specific point of time
  • Identify which crawler gave away his cookies to someone else or got hacked and had their cookies stolen
  • Only the crawlers you gave permission to crawl would be able to crawl your websites (the best advantage in my opinion)

But apparently I don't seem to find for example in Google Search Console similar functionality to give to Googlebot. This also makes websites behind a cookie captive portal unobtainable for the crawlers..

  • Why is it so?
  • Is this a flaw in the general thinking of the masterminds behind the Internet or simply a design choice?

Yes, it might apparently to required to know some of the Internet RFCs or look into them apparently :/

schroeder
  • 123,438
  • 55
  • 284
  • 319
Munchkin
  • 212
  • 2
  • 10
  • @SteffenUllrich you would need to give the cookie on for example on Google Search Console or similar administrative crawler dashboard. We already have tons of similar configuration options for Googlebot there already, we can specify the sitemap etc etc – Munchkin Nov 15 '21 at 11:31
  • 2
    The benefits you outline don't look like benefits at all. Why do you want to know what crawler looked at ***your public pages***? How do you block "unauthorised" crawlers? What's the big threat you are trying to prevent? You claim that "malicious" crawlers are a problem, but I don't think that's actually a threat that needs mitigating. How does GDPR come into any of this? – schroeder Nov 15 '21 at 15:15
  • 1
    If you're trying to keep malicious/spam bots off of a site, then how do you tell them apart from random users? Some bots are quite easy to identify, but even Google uses bots that intentionally obfuscate themselves to look like users. If you require bots to authenticate themselves, then you'll need to have every user authenticate themselves as well. – Ghedipunk Nov 15 '21 at 15:20
  • @schroeder I found it as a nice side effect. Well, in my scenario the malicious crawlers don't have the cookie to crawl and it would get stuck in the cookie captive portal or similar. My threat would be from preventing from farming e-mail addresses, phone numbers and similar information from my website by malicious intending parties. GDPR protects PII. – Munchkin Nov 15 '21 at 15:22
  • 1
    @Ghedipunk by the way that they accept cookies.. Crawlers, even malicious ones can't do that properly afaik. Please let me know if it's otherwise – Munchkin Nov 15 '21 at 15:23
  • Uh, if it is on your page, then a crawler doesn't need to access it. Any user can... your mitigation will do nothing in terms of protecting anything. – schroeder Nov 15 '21 at 15:23
  • 1
    @Munchkin If you've published it, you can't protect it any more. The game's essentially up at that point. What you suggest is [snake oil](https://en.wikipedia.org/wiki/Snake_oil) that might confuse a few stupid attackers, but provide no real protection. – vidarlo Nov 15 '21 at 15:25
  • 2
    Crawlers can't accept cookies? Huh? Where did you get this idea? – schroeder Nov 15 '21 at 15:26
  • 1
    @Munchkin: Crawlers can easily deal with cookies, this is simple. And at least Googlebot can also deal with Javascript, which is way more complex. – Steffen Ullrich Nov 15 '21 at 15:29
  • 3
    I'd suggest using the [evil bit](https://datatracker.ietf.org/doc/html/rfc3514) to indicate malicious crawlers. That way you don't have to bother google with cookies. – vidarlo Nov 15 '21 at 15:30
  • 1
    Crawlers most definitely can accept cookies. It's such a useful feature for some tasks that turning it on is 1 line in PHP. If a crawler doesn't use cookies, it's only because they're not useful for that particular crawler, not because it's difficult. If sites started requiring cookies for any access, you'll find the bots will very quickly use them correctly. – Ghedipunk Nov 15 '21 at 15:41
  • @Ghedipunk if it can accept cookies, why doesn't Googlebot crawl past my cookie captive portal then? – Munchkin Nov 16 '21 at 08:47
  • You'd have to ask them for a definitive answer, but I suspect that accepting cookies is done through a POST request, which googlebot won't do. Spambots have no such reservations. – Ghedipunk Nov 16 '21 at 18:20
  • @Ghedipunk the cookie is set on the frontend using JavaScript by clicking a button "accept" – Munchkin Nov 17 '21 at 08:02

1 Answers1

1

This doesn't seem like the sort of thing you should want to do. If you have information that requires authentication to view, then that information is probably sensitive.

Let's use social media as an example. If I have a facebook account set to private, I don't want anyone to be able to view it unless they are my friend. The way Facebook handles this is that unauthenticated users (including any crawlers) can see a version of my page that is limited in scope. Profile Pictures are all public, as well as name and maybe a little bit of identifying information, depending on customized settings.

Now I don't need to give any special privileges to Google to view that page. Any unauthenticated user can view it, so Google can crawl and index that limited version of the page. This is most important because Google will preview some text on the page. Unauthenticated users can't accidentally get sensitive info like private posts by looking at the Google preview. But also, Google employees who have access to any caches the crawler creates cannot view that sensitive information.

However, if I gave Google special credentials to view this page and index it, it's possible that attackers could gain sensitive information, like partial or possibly even full content of private posts, by manipulating Google. Say I have a stalker. He wants to know all the times I've been to a certain location, and he knows I post a lot about where I go on facebook. My facebook is set to private, though, because I only want friends to see my activity, and I've blocked him because he's acted creepy towards me before. He could search my name as well as the name of that place and possibly get access to private information by looking at the Google preview for my posts, if Google has indexed this private version of the page. If they haven't, all he will get is my sanitized public version of my private profile.

So the solution here is that if information should be publicly indexed, then you make it publicly accessible. If it should not, you do not give credentials to web crawlers. If you want to implement something like what facebook has, where authenticated users can see more information, you do what facebook does. You send different responses from the backend to the frontend based on whether the client provides legitimate credentials that allow him to access the sensitive material.

jaredad7
  • 173
  • 8