What we need: Several teams from different companies want to share our GPUs for deep learning tasks (three computers with several GPUs each). So manage multiple GPUs for multiple users.
- Different teams should not have access to the data of other teams.
- Teams itsself should be able to run whatever container they need (with GPU, e.g. tensorflow, etc.)
- Each team should have at least 8 GPUs and a maxmimum of e.g. 15 GPUs, so GPUs are used most of the time
- Stats about GPU usage would be good to see who is not using them.
- Access of several containers to same datasets (per team) to train on
- Teams should not be able to escape the container, e..g mount '/' from the host to the docker container and delete / remove / edit random files on server which would lead to data breach.
Question: What are the best open source tools to achieve this?
e.g. something like Rancher 2.0? Mesosphere? How should we set up storage? NFS? How does Uber? Google? Other DL startups do that?
Similar unanswered questions: