Questions tagged [slurm]

Slurm Workload Manager (formerly known as Simple Linux Utility for Resource Management or SLURM), or Slurm, is a free and open-source job scheduler for Linux and Unix-like kernels, used by many of the world's supercomputers and computer clusters.

38 questions
8
votes
1 answer

Slurm node daemon error: Can't open PID file

I run systemctl start slurmd.service, and it times out: Job for slurmd.service failed because a timeout was exceeded. The relevant lines from running systemctl status slurmd.service: Mar 23 17:13:42 fedora1 systemd[1]: Starting Slurm node…
user3273814
  • 213
  • 3
  • 8
4
votes
1 answer

Why does Slurm fail to start with systemd but work when starting manually?

I've just set up slurm where one physical machine will be the only system in the cluster (so far). This is on Ubuntu 18.04. I have slurmdbd running, but when I attempt to start up slurmd and slurmctld this times out. Why? I'm issuing the following…
deltafft
  • 41
  • 1
  • 2
3
votes
1 answer

Unable to contact slurm controller

I followed the steps to troubleshoot here: https://slurm.schedmd.com/troubleshoot.html. When running scontrol show slurmd, I get: Active Steps = NONE Actual CPUs = 1 Actual Boards = 1 Actual sockets =…
user3273814
  • 213
  • 3
  • 8
3
votes
3 answers

Randomize Slurm Node Allocation

Has anyone had luck randomizing Slurm node allocations? We have a small cluster of 12 nodes that could be used by anywhere from 1-8 people at a time with jobs of various size/length. When testing our new Slurm setup, jobs always go to the first node…
tnallen
  • 31
  • 1
2
votes
0 answers

Slurm - Does it maintain ccNUMA?

Does a SLURM cluster control, maintain or enforce Cache Coherence across the Nodes? Is it a configuration property, or does something like this not exist? I can't find anything inside the docs.
Semo
  • 271
  • 2
  • 9
2
votes
1 answer

slurmdbd fails to start (initial installation)

I tried to install slurmdbd for accounting on a Ubuntu 16.04 from the standard repositories (version: 15.08.7-1build1). Here are the commands: $ sudo apt-get install mysql-server $ sudo mysql > create user 'slurm'@'localhost' identified by…
Sethos II
  • 497
  • 4
  • 7
  • 18
2
votes
1 answer

How can I set up interactive-job-only or batch-job-only partition on a SLURM cluster?

I'm managing a PBS/torque HPC cluster, and now I'm setting up another cluster with SLURM. On the PBS cluster, I can set a queue to accept only interactive jobs by qmgr -c "set queue interactive_q disallowed_types = batch" and to accept only batch…
wdg
  • 143
  • 1
  • 5
2
votes
0 answers

Slurm srun cannot allocate ressources for GPUs - Invalid generic resource specification

I am able to launch a job on a GPU server the traditional way (using CPU and MEM as consumables): ~ srun -c 1 --mem 1M -w serverGpu1 hostname serverGpu1 but trying to use the GPUs will give an error: ~ srun -c 1 --mem 1M --gres=gpu:1 hostname srun:…
user324810
  • 121
  • 3
2
votes
0 answers

Managing SLURM memory on single node installation (issues)

I have SLURM setup on a single CentOS 7 node with 64 cores (128 CPU's). I have been using SLURM to submit jobs successfully using both srun and sbatch. However, it is with the caveat that I don't allocate memory. I can allocate CPU's, but not…
Wesley
  • 71
  • 4
2
votes
1 answer

SLURM with "partial" head node

I am trying to install SLURM with NFS on a small ubuntu 18.04 HPC cluster, in a typical fashion, e.g. configure controller (slurmctld) and clients (slurmd) and shared directory, etc. What I am curious about is, is there a way to set it up such that…
rage_man
  • 123
  • 3
2
votes
1 answer

Slurm: "Connection refused" for certain sacctmgr commands

I have an existing slurm cluster up and running but as of today without a configuration change I get an error when I run certain sacctmgr commands and slurmdbd crashes: $ sacctmgr list associations sacctmgr: error:…
Sethos II
  • 497
  • 4
  • 7
  • 18
2
votes
2 answers

Query peak GPU memory used by finished job

I have a SLURM job I submit with sbatch, such as sbatch --gres gpu:Tesla-V100:1 job.sh job.sh trains a model on a V100 GPU. The code itself does not log GPU memory usage. Is there a SLURM command to query peak GPU memory usage once the job is…
1
vote
0 answers

i try to srun /bin/hostname. slurmctld not respones

I have Master Node (Ubuntu 18.04) and Two Compute Node (Ubuntu 18.04) There is no problem with the connection Munge is good, I try to command (sinfo, scontrol show nodes).. that is no problem I try to find the Error... but there is no Problem I…
NAMENAME KANG
  • 21
  • 1
  • 4
1
vote
0 answers

slurm service running failed again. i don't know why

I have one master node and two slave nodes. One slave node connects successfully but one node connection failed. Each node has 18.04 Ubuntu and 17.11 Slurm If running to systemctl status slurmd.service I receive this error: slurmd.service - Slurm…
NAMENAME KANG
  • 21
  • 1
  • 4
1
vote
1 answer

How do I prevent additional jobs from a given user from starting?

With the Slurm workload manager how can I prevent more jobs from user bob from starting? Existing jobs should continue to run. The user should be able to submit more jobs but they shouldn't be able to start.
1
2 3