6

How do you use SGE to reserve complete nodes on a cluster?

I don't want 2 processors from one machine, 3 processors from another, and so on. I have a quadcore cluster and I want to reserve 4 complete machines, each having 4 slots. I cannot just specify that I want 16 slots because it does not guarantee that I will have 4 slots on 4 machines each.

Changing the allocation rule to FILL_UP isn't enough because if there are no machines that are completely idle, SGE will simply "fill up" the least loaded machines as much as possible instead of waiting for 4 idle machines and then scheduling the task.

Is there any way I can do this? Is there a better place to ask this question?

Ben Pilbrow
  • 11,995
  • 5
  • 35
  • 57
artif
  • 223
  • 1
  • 2
  • 6

7 Answers7

6

I think I found a way, but it probably doesn't work on old SGE's like mine. It seems new version of SGE has exclusive scheduling built in.

https://web.archive.org/web/20101027190030/http://wikis.sun.com/display/gridengine62u3/Configuring+Exclusive+Scheduling

Another possibility I've considered, but quite error prone, is to use qlogin instead of qsub and manually reserve 4 slots on each desired quadcore machine. Understandably, automating this is not particularly easy or fun.

Lastly, maybe this is a situation where hostgroups can be used. So for example, creating a hostgroup with 4 quadcore machines in it and then qsubbing to this specific subset of a queue, requesting a number of processors equal to the maximum total number in the group. Unfortunately this is kind of like hardcoding and has a lot of drawbacks eg having to wait for people to vacate a particular hardcoded hostgroup and requiring changes if you want to switch to 8 instead of 4 machines etc.

Martin M.
  • 6,428
  • 2
  • 24
  • 42
artif
  • 223
  • 1
  • 2
  • 6
3

It seems like there is this hidden command-line request to add:

-l excl=true

But you have to configure it into your SGE or OpenGridScheduler by adding it to the list of complex values (qconf -mc) and enabling each individual host (qconf -me hostname)

see this link for details: http://web.archive.org/web/20130706011021/http://docs.oracle.com/cd/E24901_01/doc.62/e21978/management.htm#autoId61

In summary:

type:

qconf -mc

and add the line:

exclusive    excl      BOOL      EXCL   YES          YES          0        1000

then:

qconf -me <host_name>

and edit then complex_values line to read:

complex_values        exclusive=true

If you have any host-specific complex_values already in there, then just comma separate them.

Martin M.
  • 6,428
  • 2
  • 24
  • 42
OttoV
  • 133
  • 5
2

SGE is weird with this, and I haven't found a good way to do this in the general case. One thing that you can do, if you know the memory size of the node you want, is to qsub while reserving an amount of memory almost equal to the full capacity of the node. This will ensure it grabs a system with nothing else running on it.

jgoldschrafe
  • 4,385
  • 17
  • 18
  • Thanks, I considered something similar earlier except it was with load_avg instead of memory. Basically, like you say, it should be possible to specify a hard limit on a resource, that would probably only be satisfied if the machine is idle. – artif May 06 '11 at 08:40
  • After thinking about it some more, I think using resource limits is probably the best solution if your SGE doesn't support exclusive scheduling. Otherwise, the person should refer to my link to Sun's wiki. Anyways, thanks for the input! – artif May 06 '11 at 08:51
1

I'm trying to do almost exactly the same thing and am looking for ideas. I think a pe_hostsfile is the best option, but I'm not a manager of our SGE system, and there's no hosts files configured, so I need a quick work around. Just checked out the Configuring Exclusive scheduling link, and see that that also requires managerial rights...

I think a wrapper script could do it. I wrote a bash one-liner to figure out the number of available cores left on a machine (below). Our grid is heterogeneous, with one node having 24 cores, some 8, and the majority only 4, which makes things a little awkward.

Here's that bash one-liner anyway.

n_processors=`qhost | awk 'BEGIN{name="'\`hostname\`'"} ; {if($1==name){print int($3)-int($4+0.99)}}'`

Problem now is how to get this bash variable into a SGE startup script preprocessing directive?? Maybe I'll just provide the below arg in my shell script, as the pvm environment ships with SGE. Doesn't mean it's configured though...

#$ -pe pvm 24-4

Sun's page on Managing Parallel Environments is pretty helpful, although again, the instructions are mostly aimed for administrators.

Alex Leach
  • 1,577
  • 3
  • 14
  • 18
1

We set the allocation rule to the number of slots available on the node (in this case, 4). This means you can only start jobs with n*4 CPUs, but it will achieve the desired result: 16 CPUs will be allocated as 4 nodes with 4 CPUs each.

Ansgar
  • 11
  • 1
1

Specify the allocation rule in PE configuration as $pe_slots

This will cause all slots to be allocated on a single host

0

I finally found the answer to this. At first I used the the above -l excl=True setup as described above. However this does not quite solve the problem.

To fully solve the problem I had to set up and additional pe_environment. On my cluster we have a number of 12 core nodes. So I will use this as my example.

I created an additional environment called mpich2_12. Pasted below..

pe_name            mpich_12
slots              999
user_lists         sge_user
xuser_lists        NONE
start_proc_args    /opt/gridengine/mpi/startmpi.sh -catch_rsh $pe_hostfile
stop_proc_args     /opt/gridengine/mpi/stopmpi.sh
allocation_rule    12
control_slaves     TRUE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary TRUE

Note here that the allocation_rule is set to 12, this means that the job MUST use 12 cores on a node. If you submit a job requesting 48 CPUs, it will wait and grab 4 FULL nodes when they are available.

I still use the -l excl=True option, but i suspect this is irrelevant now.

If I have jobs that require only one CPU (and I do), I submit them to the same queue, but without the -l exel=True option, and I use my original pe_environment which has the allocation_rule = 'fillup' Any job submitted with the mpich_12 environment will wait till there are complete nodes free. My cluster works so much better now.

sebix
  • 4,175
  • 2
  • 25
  • 45
Simon
  • 1