I'm currently facing the problem of integrating GPU-Servers into an existing SGE environment. Using google I found some examples of Clusters where this has been set up but no information on how this had been done.
Is there some form of howto or tutorial on this anywhere? It doesn't have to be ultra verbose but it should contain enough information to get a "cuda queue" up and running...
Thanks in advance...
Edit: To set up a load sensor about how many GPUs in a node are free, I've done the following:
- set the compute mode of the GPUs to exclusive
- set the GPUs to persistent mode
- add the following script to the cluster configuration as load sensor (and set it so 1 sec.)
#!/bin/sh
hostname=`uname -n`
while [ 1 ]; do
read input
result=$?
if [ $result != 0 ]; then
exit 1
fi
if [ "$input" == "quit" ]; then
exit 0
fi
smitool=`which nvidia-smi`
result=$?
if [ $result != 0 ]; then
gpusav=0
gpus=0
else
gpustotal=`nvidia-smi -L|wc -l`
gpusused=`nvidia-smi |grep "Process name" -A 6|grep -v +-|grep -v \|=|grep -v Usage|grep -v "No running"|wc -l`
gpusavail=`echo $gpustotal-$gpusused|bc`
fi
echo begin
echo "$hostname:gpu:$gpusavail"
echo end
done
exit 0
Note: This obviously works only for NVIDIA GPUs