I have SLURM setup on a single CentOS 7 node with 64 cores (128 CPU's). I have been using SLURM to submit jobs successfully using both srun
and sbatch
. However, it is with the caveat that I don't allocate memory. I can allocate CPU's, but not memory.
When I try to allocate memory I get
sbatch: error: Memory specification can not be satisfied
sbatch: error: Batch job submission failed: Requested node configuration is not available
So this will run
#!/bin/bash
#SBATCH --job-name=name
#SBATCH --output=name.txt
#SBATCH --cpus-per-task=10
#SBATCH --time=6-59:00
But this will not run
#!/bin/bash
#SBATCH --job-name=name
#SBATCH --output=name.txt
#SBATCH --cpus-per-task=10
#SBATCH --mem=2000M
#SBATCH --time=6-59:00
similarly this won't run
#!/bin/bash
#SBATCH --job-name=name
#SBATCH --output=name.txt
#SBATCH --cpus-per-task=10
#SBATCH --mem-per-cpu=2000M
#SBATCH --time=6-59:00
Both give the above error message.
This is a pain because now that I am starting to max out the cpu usage, I am having jobs clash and fail, and I believe it is because there isn't proper memory allocation, so programs will crash with bad alloc
error messages, or just stop running. I have used SLURM quite abit on compute canada clusters, and assigning memory was no issue. Is the problem that I am running SLURM on a single node which is also the login node? or that I am essentially using default settings and need to do some admin work?
I have tried using different units for memory such as 2G
rather than 2000M
and I have tried using 1024M
as well, but to no avail.
The slurm.conf file is
ClusterName=linux
ControlMachine=dummyname
ControlAddr=dummyaddress
#BackupController=
#BackupAddr=
#
#SlurmUser=slurm
SlurmdUser=root
SlurmctldPort=dummyport
SlurmdPort=dummyport+1
AuthType=auth/munge
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
StateSaveLocation=/var/lib/slurm
SlurmdSpoolDir=/tmp/slurmd
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
ProctrackType=proctrack/pgid
#PluginDir=
#FirstJobId=
ReturnToService=1
#MaxJobCount=
#PlugStackConfig=
#PropagatePrioProcess=
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#Prolog=
#Epilog=
#SrunProlog=
#SrunEpilog=
#TaskProlog=
#TaskEpilog=
#TaskPlugin=
#TrackWCKey=no
#TreeWidth=50
#TmpFS=
#UsePAM=
#
# TIMERS
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
#
# SCHEDULING
SchedulerType=sched/backfill
#SchedulerAuth=
#SchedulerPort=
#SchedulerRootFilter=
SelectType=select/cons_res
SelectTypeParameters=CR_CORE
#FastSchedule=1
#PriorityType=priority/multifactor
#PriorityDecayHalfLife=14-0
#PriorityUsageResetPeriod=14-0
#PriorityWeightFairshare=100000
#PriorityWeightAge=1000
#PriorityWeightPartition=10000
#PriorityWeightJobSize=1000
#PriorityMaxAge=1-0
#
# LOGGING
#DebugFlags=gres
SlurmctldDebug=3
#SlurmctldLogFile=
SlurmdDebug=3
#SlurmdLogFile=
JobCompType=jobcomp/none
#JobCompLoc=
#
# ACCOUNTING
#JobAcctGatherType=jobacct_gather/linux
#JobAcctGatherFrequency=30
#
#AccountingStorageType=accounting_storage/slurmdbd
#AccountingStorageHost=
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStorageUser=
#
# COMPUTE NODES
GresTypes=gpu
NodeName=dummyname CoresPerSocket=64 Sockets=1 ThreadsPerCore=2 State=IDLE Gres=gpu:2
#NodeName=dummyname CoresPerSocket=64 Sockets=1 ThreadsPerCore=2 State=IDLE
PartitionName=all Nodes=dummyname Default=YES Shared=Yes MaxTime=INFINITE State=UP