2

What we need: Several teams from different companies want to share our GPUs for deep learning tasks (three computers with several GPUs each). So manage multiple GPUs for multiple users.

  • Different teams should not have access to the data of other teams.
  • Teams itsself should be able to run whatever container they need (with GPU, e.g. tensorflow, etc.)
  • Each team should have at least 8 GPUs and a maxmimum of e.g. 15 GPUs, so GPUs are used most of the time
  • Stats about GPU usage would be good to see who is not using them.
  • Access of several containers to same datasets (per team) to train on
  • Teams should not be able to escape the container, e..g mount '/' from the host to the docker container and delete / remove / edit random files on server which would lead to data breach.

Question: What are the best open source tools to achieve this?

e.g. something like Rancher 2.0? Mesosphere? How should we set up storage? NFS? How does Uber? Google? Other DL startups do that?

Similar unanswered questions:

andi
  • 21
  • 1
  • Welcome to Server Fault! **Requests for product, service, or learning material recommendations** are considered [**off-topic**](http://serverfault.com/help/on-topic) on serverfault.com. Potentially your question can be reworded or made suitable for the [Software Recommendations](http://softwarerecs.stackexchange.com/help/on-topic) Stack Exchange community, but before posting, please read their guidelines. Alternatively Wikipedia often has lists of available products. – HBruijn Dec 08 '17 at 10:56
  • If this seems to be off-topic, where can I ask this or post this question? What are suited sites for this type of question? – andi Dec 08 '17 at 11:05

0 Answers0