1

I currently have a cluster of 10 worker nodes managed by Slurm with 1 master node. I have previously successfully set up the cluster, after some teething problems, but managed to get it working. I put all my scripts and instructions on my GitHub repo, here:

https://brettchapman.github.io/Nimbus_Cluster

I recently needed to start over again to increase hard drive space, and now can't seem to install and configure it correctly no matter what I've tried.

Slurmctld and slurmdbd install and are configured correctly (both active and running with the systemctl status command), however slurmd remains in a failed/inactive state.

The following is my slurm.conf file:

slurm.conf file generated by configurator.html.

Put this file on all nodes of your cluster.

See the slurm.conf man page for more information.

SlurmctldHost=node-0 #SlurmctldHost=

#DisableRootJobs=NO #EnforcePartLimits=NO #Epilog= #EpilogSlurmctld= #FirstJobId=1 #MaxJobId=999999 #GresTypes= #GroupUpdateForce=0 #GroupUpdateTime=600 #JobFileAppend=0 #JobRequeue=1 #JobSubmitPlugins=1 #KillOnBadExit=0 #LaunchType=launch/slurm #Licenses=foo*4,bar #MailProg=/bin/mail #MaxJobCount=5000 #MaxStepCount=40000 #MaxTasksPerNode=128 MpiDefault=none #MpiParams=ports=#-# #PluginDir= #PlugStackConfig= #PrivateData=jobs ProctrackType=proctrack/cgroup #Prolog= #PrologFlags= #PrologSlurmctld= #PropagatePrioProcess=0 #PropagateResourceLimits= #PropagateResourceLimitsExcept= #RebootProgram= ReturnToService=1 #SallocDefaultCommand= SlurmctldPidFile=/var/run/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=slurm #SlurmdUser=root #SrunEpilog= #SrunProlog= StateSaveLocation=/var/spool/slurm-llnl SwitchType=switch/none #TaskEpilog= TaskPlugin=task/cgroup #TaskPluginParam= #TaskProlog= #TopologyPlugin=topology/tree #TmpFS=/tmp #TrackWCKey=no #TreeWidth= #UnkillableStepProgram= #UsePAM=0

TIMERS

#BatchStartTimeout=10 #CompleteWait=0 #EpilogMsgTime=2000 #GetEnvTimeout=2 #HealthCheckInterval=0 #HealthCheckProgram= InactiveLimit=0 KillWait=30 #MessageTimeout=10 #ResvOverRun=0 MinJobAge=300 #OverTimeLimit=0 SlurmctldTimeout=120 SlurmdTimeout=600 #UnkillableStepTimeout=60 #VSizeFactor=0 Waittime=0

SCHEDULING

#DefMemPerCPU=0 #MaxMemPerCPU=0 #SchedulerTimeSlice=30 SchedulerType=sched/backfill SelectType=select/cons_res SelectTypeParameters=CR_Core

JOB PRIORITY

#PriorityFlags= #PriorityType=priority/basic #PriorityDecayHalfLife= #PriorityCalcPeriod= #PriorityFavorSmall= #PriorityMaxAge= #PriorityUsageResetPeriod= #PriorityWeightAge= #PriorityWeightFairshare= #PriorityWeightJobSize= #PriorityWeightPartition= #PriorityWeightQOS=

LOGGING AND ACCOUNTING

#AccountingStorageEnforce=0 #AccountingStorageHost= #AccountingStorageLoc= #AccountingStoragePass= #AccountingStoragePort= AccountingStorageType=accounting_storage/filetxt #AccountingStorageUser= AccountingStoreJobComment=YES ClusterName=cluster #DebugFlags= JobCompHost=localhost JobCompLoc=slurm_acct_db JobCompPass=password #JobCompPort= JobCompType=jobcomp/mysql JobCompUser=slurm #JobContainerType=job_container/none JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/none SlurmctldDebug=info SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log SlurmdDebug=info SlurmdLogFile=/var/log/slurm-llnl/slurmd.log #SlurmSchedLogFile= #SlurmSchedLogLevel=

POWER SAVE SUPPORT FOR IDLE NODES (optional)

#SuspendProgram= #ResumeProgram= #SuspendTimeout= #ResumeTimeout= #ResumeRate= #SuspendExcNodes= #SuspendExcParts= #SuspendRate= #SuspendTime=

COMPUTE NODES

NodeName=node-[1-10] NodeAddr=node-[1-10] CPUs=16 RealMemory=64323 Sockets=1 CoresPerSocket=8 ThreadsPerCore=2 State=UNKNOWN PartitionName=debug Nodes=node-[1-10] Default=YES MaxTime=INFINITE State=UP


And the following is my slurmdbd.conf file:

AuthType=auth/munge
AuthInfo=/run/munge/munge.socket.2
DbdHost=localhost
DebugLevel=info
StorageHost=localhost
StorageLoc=slurm_acct_db
StoragePass=password
StorageType=accounting_storage/mysql
StorageUser=slurm
LogFile=/var/log/slurm-llnl/slurmdbd.log
PidFile=/var/run/slurmdbd.pid
SlurmUser=slurm

Running pdsh -a sudo systemctl status slurmd on my compute nodes gives me the following error:

pdsh@node-0: node-5: ssh exited with exit code 3
node-6: ● slurmd.service - Slurm node daemon
node-6:      Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor preset: enabled)
node-6:      Active: inactive (dead) since Tue 2020-08-11 03:52:58 UTC; 2min 45s ago
node-6:        Docs: man:slurmd(8)
node-6:     Process: 9068 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)
node-6:    Main PID: 8983
node-6: 
node-6: Aug 11 03:34:09 node-6 systemd[1]: Starting Slurm node daemon...
node-6: Aug 11 03:34:09 node-6 systemd[1]: slurmd.service: Supervising process 8983 which is not our child. We'll most likely not notice when it exits.
node-6: Aug 11 03:34:09 node-6 systemd[1]: Started Slurm node daemon.
node-6: Aug 11 03:52:58 node-6 systemd[1]: slurmd.service: Killing process 8983 (n/a) with signal SIGKILL.
node-6: Aug 11 03:52:58 node-6 systemd[1]: slurmd.service: Killing process 8983 (n/a) with signal SIGKILL.
node-6: Aug 11 03:52:58 node-6 systemd[1]: slurmd.service: Succeeded.
pdsh@node-0: node-6: ssh exited with exit code 3

I did not previously receive this type of error before when I had my cluster up and running, so I'm unsure of what I did or didn't do between now and last time I had it running. My guess is it's something to do with file/folder permissions, as I have found that can be quite critical when setting up. I may have missed documenting something I did previously. This is my second attempt at setting up a slurm managed cluster.

My entire workflow and scripts can be followed from my GitHub repo. If you need any other error outputs, please ask.

Thank you for any help you can provide.

Brett

BaronSamedi1958
  • 12,510
  • 1
  • 20
  • 46

0 Answers0