1

How can I block any crawlers to access anything on gitlab?

there should be a robots.txt or something similar to tell not to crawl. That would be good as a first step.

But the more important thing, how can I tell gitlab only authenticated access is allowed? e.g.

https://gitlab.yourdomain.com/ is accessible public

also

https://gitlab.yourdomain.com/explore is accessible public

if both URLs are protected behind a authentication no crawler can even fetch anything anymore. But how to configure it with gitlab CE?

To be even more clear nothing else except the login dialog shall be public visible. How to manage this with gitlab CE?

cilap
  • 277
  • 5
  • 14

2 Answers2

3

There is a robots.txt in the repository

https://gitlab.com/gitlab-org/gitlab-foss/blob/master/public/robots.txt

Also, if you set a projects visibility to private, you won't be able to view the project at the URL's in your example.

Bert
  • 2,733
  • 11
  • 12
  • thx for the suggestion, the projects are all not visible, still the explore and users are accessible. I do not want to have anything public visible except login dialog – cilap Apr 11 '20 at 15:22
0

As mentioned here, Using robots.txt is not enough

  • robots.txt directives may not be supported by all search engines.
  • Different crawlers interpret syntax differently.
  • A page that's disallowed in robots.txt can still be indexed if linked to from other sites.

So you need to use noindex.

Vahid
  • 101
  • 2