In the current day and age we have the problem of malicious/spam crawlers and similar concerns.
My suggestion would be implementing cookie support for crawling and by that I mean giving specific cookies with crawler ID (at best refreshed using refresh tokens), which get you identify the crawler and the crawler using the cookie can access all the features of the website. This whole topic floats up only due to the fact that GDPR and similar legal frameworks exist, previously this whole thing would've been redundant.
Doing this would give the following benefits:
- One could identify when the crawler and which crawler crawled at a specific point of time
- Identify which crawler gave away his cookies to someone else or got hacked and had their cookies stolen
- Only the crawlers you gave permission to crawl would be able to crawl your websites (the best advantage in my opinion)
But apparently I don't seem to find for example in Google Search Console similar functionality to give to Googlebot. This also makes websites behind a cookie captive portal unobtainable for the crawlers..
- Why is it so?
- Is this a flaw in the general thinking of the masterminds behind the Internet or simply a design choice?
Yes, it might apparently to required to know some of the Internet RFCs or look into them apparently :/