-1

I've got a GPU workstation with 48 core CPU + 4 NVIDIA GPU. I am going to make this machine to be a small cluster which contains:

4 nodes 12 core +1 CPU/node

I've installed Torque in this machine with command:

./configure --without-tcl --enable-nvidia-gpus --prefix=/soft/torque-5.1.1 --with-nvml-include=/usr/local/cuda/gpukit/usr/include/nvidia/gdk --with-nvml-lib=/usr/local/cuda/lib64

Then I set /etc/hosts as:

127.0.0.1       localhost cudaC
127.0.0.1       localhost cudaC1
127.0.0.1       localhost cudaC2
127.0.0.1       localhost cudaC3
xxx.xxx.xxx.x   torqueserver

after that, I added the following to /var/spool/torque/server_priv/nodes:

cudaC np=12 gpus=4
cudaC1 np=12 gpus=1
cudaC2 np=12 gpus=1
cudaC3 np=12 gpus=1

Then start pbsserver:

#cd /soft/torque-5.1.1/sbin
#./pbs_sever
#./pbs_sched
#./ pbs_mom

check the status with command pbsnodes:

cudaC                                                                                                                                                         
     state = free                                                                                                                                             
     power_state = Running                                                                                                                                    
     np = 12                                                                                                                                                  
     ntype = cluster                                                                                                                                          
     status = rectime=1435734456,cpuclock=Fixed,varattr=,jobs=,state=free,netload=136578103,gres=,loadave=0.00,ncpus=48,physmem=65982324kb,availmem=86084596kb,totmem=86954864kb,idletime=72,nusers=2,nsessions=5,sessions=1519 2350 6570 6781 11017,uname=Linux cudaC 3.16.7-21-desktop #1 SMP PREEMPT Tue Apr 14 07:11:37 UTC 2015 (93c1539) x86_64,opsys=linux                                                                                                                         
     mom_service_port = 15002                                                                                                                                 
     mom_manager_port = 15003                                                                                                                                 
     gpus = 4                                                                                                                                                 
     gpu_status = gpu[3]=gpu_id=0000:83:00.0;gpu_pci_device_id=398594270;gpu_pci_location_id=0000:83:00.0;gpu_product_name=Graphics Device;gpu_display=Enabled;gpu_fan_speed=22%;gpu_memory_total=12287 MB;gpu_memory_used=23 MB;gpu_mode=Default;gpu_state=Unallocated;gpu_utilization=0%;gpu_memory_utilization=0%;gpu_temperature=43 C,gpu[2]=gpu_id=0000:82:00.0;gpu_pci_device_id=398594270;gpu_pci_location_id=0000:82:00.0;gpu_product_name=Graphics Device;gpu_display=Enabled;gpu_fan_speed=22%;gpu_memory_total=12287 MB;gpu_memory_used=23 MB;gpu_mode=Default;gpu_state=Unallocated;gpu_utilization=0%;gpu_memory_utilization=0%;gpu_temperature=43 C,gpu[1]=gpu_id=0000:03:00.0;gpu_pci_device_id=398594270;gpu_pci_location_id=0000:03:00.0;gpu_product_name=Graphics Device;gpu_display=Enabled;gpu_fan_speed=22%;gpu_memory_total=12287 MB;gpu_memory_used=23 MB;gpu_mode=Default;gpu_state=Unallocated;gpu_utilization=0%;gpu_memory_utilization=0%;gpu_temperature=45 C,gpu[0]=gpu_id=0000:02:00.0;gpu_pci_device_id=398594270;gpu_pci_location_id=0000:02:00.0;gpu_product_name=Graphics Device;gpu_display=Enabled;gpu_fan_speed=22%;gpu_memory_total=12287 MB;gpu_memory_used=45 MB;gpu_mode=Default;gpu_state=Unallocated;gpu_utilization=0%;gpu_memory_utilization=1%;gpu_temperature=39 C,driver_ver=346.46,timestamp=Wed Jul  1 09:07:36 2015                                                                                                        

cudaC1                                                                                                                                                        
     state = down                                                                                                                                             
     power_state = Running
     np = 12
     ntype = cluster
     mom_service_port = 15002
     mom_manager_port = 15003
     gpus = 1

cudaC2
     state = down
     power_state = Running
     np = 12
     ntype = cluster
     mom_service_port = 15002
     mom_manager_port = 15003
     gpus = 1

cudaC3
     state = down
     power_state = Running
     np = 12
     ntype = cluster
     mom_service_port = 15002
     mom_manager_port = 15003
     gpus = 1

It seems that only one node works find and all 4 GPU were distributed to this node.

I am just wondering how can we solve this problem?

peterh
  • 4,914
  • 13
  • 29
  • 44

1 Answers1

1

My answer is maybe not addressing you question directly, but I've passed this whole topic a couple years ago and I suggest you to use slurm instead of torque. As far as I remember torque doesn't use CUDA_VISIBLE_DEVICES environment variables for scheduling processes without any additional patches, but thats the way how NVIDIA intended it to be (so most applications are looking for CUDA_VISIBLE_DEVICES).

Slurm instead comes with build in gpu support via generic resources. In a mixed environment you can even define multiple card types and specify what should be used for your job.

Beside our 20 cards setup I've seen a couple bigger gpu-clusters and they were all using slurm.

Henrik
  • 668
  • 5
  • 19
  • I see, thanks a lot for the comments. I just google slurm, but I seldom found how to configure multiple nodes for GPU workstation neither.... Do you have any idea how to make it work? I just know someone use SGE or GE for this purpose, but I don't know how they make it work.... – user2689449 Jul 01 '15 at 09:37
  • As far as I know, you can't manage gpus in Torque. You are supposed to be able to add that functionality by using Moab with Torque. Moab is made by the same company that distributes Torque and is their flagship product. I've seen quotes for Moab and I doubt that is a road you want to go down. I would do as @Henrik suggested and look at either Slurm or * Grid Engine. If it makes the transition easier, Slurm has a Torque compatibility component that will get you your familiar qsub, qstat commands and the ability to run your existing job files. – chuck Jul 01 '15 at 12:25
  • @user2689449 https://computing.llnl.gov/linux/slurm/gres.html - AFAIK SGE and GE are struggling also with gpu handling - The initial effort for setting up and understanding slurm is (or was at least for me) higher than for torque, but once your figured it out, its pretty good. I would suggest to first setup your cluster as cpu only, test scheduling and slurm in general, then add the gpus as generic resources if its running smoothly – Henrik Jul 01 '15 at 12:34
  • @user2689449 some additions: Slurm config file generator https://computing.llnl.gov/linux/slurm/configurator.html ; our slurm config http://paste.ubuntu.com/11804379/ ; gres config has to be node specific – Henrik Jul 01 '15 at 12:45
  • OK. that's pretty helpful. I will play around these days. Many thanks again. – user2689449 Jul 01 '15 at 14:12
  • You're welcome - If this solves you problem, I would love, if you mark my answer as accepted :) – Henrik Jul 01 '15 at 14:19