Managing SLURM memory on single node installation (issues)

Question

I have SLURM setup on a single CentOS 7 node with 64 cores (128 CPU's). I have been using SLURM to submit jobs successfully using both srun and sbatch. However, it is with the caveat that I don't allocate memory. I can allocate CPU's, but not memory.

When I try to allocate memory I get

sbatch: error: Memory specification can not be satisfied
sbatch: error: Batch job submission failed: Requested node configuration is not available

So this will run

#!/bin/bash
#SBATCH --job-name=name
#SBATCH --output=name.txt
#SBATCH --cpus-per-task=10
#SBATCH --time=6-59:00

But this will not run

#!/bin/bash
#SBATCH --job-name=name
#SBATCH --output=name.txt
#SBATCH --cpus-per-task=10
#SBATCH --mem=2000M
#SBATCH --time=6-59:00

similarly this won't run

#!/bin/bash
#SBATCH --job-name=name
#SBATCH --output=name.txt
#SBATCH --cpus-per-task=10
#SBATCH --mem-per-cpu=2000M
#SBATCH --time=6-59:00

Both give the above error message.

This is a pain because now that I am starting to max out the cpu usage, I am having jobs clash and fail, and I believe it is because there isn't proper memory allocation, so programs will crash with bad alloc error messages, or just stop running. I have used SLURM quite abit on compute canada clusters, and assigning memory was no issue. Is the problem that I am running SLURM on a single node which is also the login node? or that I am essentially using default settings and need to do some admin work?

I have tried using different units for memory such as 2G rather than 2000M and I have tried using 1024M as well, but to no avail.

The slurm.conf file is

ClusterName=linux
ControlMachine=dummyname

ControlAddr=dummyaddress
#BackupController=
#BackupAddr=
#
#SlurmUser=slurm
SlurmdUser=root
SlurmctldPort=dummyport
SlurmdPort=dummyport+1
AuthType=auth/munge
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
StateSaveLocation=/var/lib/slurm
SlurmdSpoolDir=/tmp/slurmd
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
ProctrackType=proctrack/pgid
#PluginDir=
#FirstJobId=
ReturnToService=1
#MaxJobCount=
#PlugStackConfig=
#PropagatePrioProcess=
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#Prolog=
#Epilog=
#SrunProlog=
#SrunEpilog=
#TaskProlog=
#TaskEpilog=
#TaskPlugin=
#TrackWCKey=no
#TreeWidth=50
#TmpFS=
#UsePAM=
#
# TIMERS
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
#
# SCHEDULING
SchedulerType=sched/backfill
#SchedulerAuth=
#SchedulerPort=
#SchedulerRootFilter=
SelectType=select/cons_res
SelectTypeParameters=CR_CORE
#FastSchedule=1
#PriorityType=priority/multifactor
#PriorityDecayHalfLife=14-0
#PriorityUsageResetPeriod=14-0
#PriorityWeightFairshare=100000
#PriorityWeightAge=1000
#PriorityWeightPartition=10000
#PriorityWeightJobSize=1000
#PriorityMaxAge=1-0
#
# LOGGING
#DebugFlags=gres
SlurmctldDebug=3
#SlurmctldLogFile=
SlurmdDebug=3
#SlurmdLogFile=
JobCompType=jobcomp/none
#JobCompLoc=
#
# ACCOUNTING
#JobAcctGatherType=jobacct_gather/linux
#JobAcctGatherFrequency=30
#
#AccountingStorageType=accounting_storage/slurmdbd
#AccountingStorageHost=
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStorageUser=
#
# COMPUTE NODES
GresTypes=gpu
NodeName=dummyname CoresPerSocket=64 Sockets=1 ThreadsPerCore=2 State=IDLE Gres=gpu:2
#NodeName=dummyname CoresPerSocket=64 Sockets=1 ThreadsPerCore=2 State=IDLE
PartitionName=all Nodes=dummyname Default=YES Shared=Yes MaxTime=INFINITE State=UP

@GeraldSchneider I added the `slurm.conf` file. There does not appear to be a `cgroup.conf`. `/slurm/` has a `cgroup.conf.example` file, but that is all. — Wesley, Nov 08 '21 at 14:53
You haven't defined any memory configuration for your node. Try adding the `RealMemory=` parameter to your `NodeName=` line. — Gerald Schneider, Nov 08 '21 at 14:57
@GeraldSchneider I still get the error. I added `RealMemory=200000` to the line — Wesley, Nov 08 '21 at 15:04
I also get the same error when I try adding SelectTypeParameters=CR_CPU_Memory and other variants involving memory — Wesley, Nov 08 '21 at 15:25
Anymore ideas @GeraldSchneider? 50 points isn't much, but they are honest points :) — Wesley, Nov 10 '21 at 18:38
@wesley, I was able to fix what might be the same error by setting `MaxMemPerNode`. See https://slurm.schedmd.com/cons_res_share.html — Nathan Musoke, Dec 09 '21 at 20:18

Managing SLURM memory on single node installation (issues)

0 Answers0