Using sge with resource complex called 'gpu.q' that allows resource management of gpu devices (these are all nvidia devices). However on the systems there are multiple gpu devices (in exclusive mode) and if two jobs are allocated on the same node there is no way for the user to opaquely create a context on the correct gpu.
Has anyone run into this problem ? I was thinking of somehow managing specific gpu resources and mapping the host and device id's. Something like
hostA -> gpu0:in_use
hostA -> gpu1:free
hostB -> gpu0:free
hostB -> gpu1:in_use
etc... And then upon resource request, reveal allocated gpu resources on each host through CUDA_VISIBLE_DEVICES variable.
This seems like a fairly common issue - it must have been solved by someone by now with the prevalence of gpu's in compute clusters.