> Question 1: the control machine
At Userify (full disclosure: we actually offer software to manage ssh keys), we deal with this all the time, since we also run the largest SSH key warehouse. We generally recommend local installation rather than using cloud, since you have increased control, reduce your surface area, you can really lock it down to just known trusted networks.
The important thing to remember is that, in a properly built system like this, there really shouldn't be any significant secrets that can be leaked to an attacker. If someone drives a forklift into your datacenter and walks away with your server, they won't get a whole lot except for some heavily hashed passwords, probably some heavily encrypted files, and some public keys without their corresponding private keys. In other words, not that much.
As you point out, the real threat vectors here are what happens if an attacker gains control of that machine and uses it to deploy their own user accounts and (public) keys. This is a risk for virtually every cloud platform (ex:Linode). You should be most strongly focused on preventing access to the control plane, which means minimizing attack surface (only exposing a few ports, and locking down those ports as much as possible) and preferably using software that is hardened against privilege escalation and various attacks (SQL injection, XSS, CSRF, etc.) Enable 2FA/MFA access to the control plane and focus on locking down that control plane as much as possible.
So is it better to have a dedicated control machine in data center or a remote control machine (like my laptop remotely connected to the data center)?
It's definitely better to have a dedicated control machine in a secure datacenter, because you can isolate it and lock it down to prevent/minimize risk of theft or unauthorized access.
If the best practice is to use my laptop (which could be stolen, of course, but I could have my public keys securely saved online in the cloud or offline on a portable crypted device), what if I need to use some web interfaces with Ansible, like Ansible Tower, Semaphore, Rundeck or Foreman which needs to be installed on a centralised machine into the datacenter?
You don't need to run ANY web interface or secondary control plane to manage your keys (even Userify) until you get large enough to start getting into management issues due to a larger number of users and different levels of authorization across servers or need extra hand-holding for your users who may not have knowledge or access to Ansible to update keys. Userify at first was not much more than a bunch of shell scripts (today they'd be Ansible, probably!) and there's nothing wrong with that at all, until you start needing additional management control and easy ways for people to manage/rotate their own keys. (Of course, please take a look at Userify if you get to that point!)
How to secure it and avoid it to become a "single point of attack"?
Well, of course check out all of the resources on the net for locking things down, but most importantly start with a secure foundation:
1. Architect your solution with security in mind from the very beginning. Choose technology (i.e., database, or languages) that have traditionally had fewer problems, and then code with security at front-of-mind. Sanitize all incoming data, even from trusted users. Paranoia is a virtue.
2. Eventually, everything gets broken. Minimize the damage when that occurs: as you pointed out already, try to minimize the handling of secret material.
3. Keep it simple. Don't do the latest exotic stuff unless you're certain it will measurably and provably increase your security. For example, we selected X25519/NaCl (libsodium) over AES for our encryption layer (we encrypt everything, at rest and in motion), because it was originally designed and written by someone we trusted (DJB et al) and was reviewed by world-renowned researchers like Schneier and Google's security team. Use things that tend toward simplicity if they are newer, since simplicity makes it harder to conceal deep bugs.
4. Meet security standards. Even if you don't fall into a security regime like PCI or the HIPAA Security Rule, read through those standards and figure out how to meet them or at least very strong compensating controls. This will help ensure that you are truly meeting 'best practices'.
5. Bring in outside/independent penetration testing and run bug bounties to make sure you are following those best practices on an on-going basis. Everything looks great until you get some smart and highly motivated people banging on it... once that's finished, you'll have a great deal of confidence in your solution.
Question 2: the SSH keys
What is the best choice: let Ansible use the root user (with its public key saved in ~/.ssh/authorized_keys
/ let the Ansible user to run every commands through sudo specifying a password (which is unique needs to be known by every sysadmin which uses Ansible to control that servers)
Try to avoid ever using passwords on servers, even for sudo. That is dealing with secrets and ultimately will undermine your security (you can't really vary that sudo password between machines very easily, you have to store it somewhere, the password means you can't really do server-to-server automation which is exactly what it is all about. Also, if you leave SSH at its defaults, those passwords can be brute forced, which makes the keys somewhat meaningless. Also, avoid use of root user for any purpose, and especially remote login.
Create a unprivileged user dedicated for Ansible with sudo access / let the Ansible user to run every commands through sudo without specifying any password
Exactly. An unprivileged user that you can audit back to ansible, with sudo roles. Ideally, create a standard user dedicated to server-to-server/ansible communications with sudo access (without password).
... N.B., if you were using Userify, the way I would suggest doing it would be to create a Userify user for ansible (you can also break this up by project or even server group if you have multiple ansible control machines), generate an SSH key on the control server, and provide its public key in its Userify profile page. (This textbox essentially becomes /home/ansible/.ssh/authorized_keys
). You should keep the ansible system account separate from other server-to-server system accounts such as a remote backup account, secret management, etc. Then invite your humans and they can create and manage their own keys as well and everything stays separated. But, just like with locking down an Ansible control server, try to lock down your Userify server (or whatever solution you deploy) in the same way.
any other hints?
I think you're definitely going about this the right way and asking the right questions. If you'd like to discuss this sort of thing, email me (first dot last name at userify) and I'd be glad to have a chat no matter what direction you ultimately pursue. Good luck!