2

I have a SLURM job I submit with sbatch, such as

sbatch --gres gpu:Tesla-V100:1 job.sh

job.sh trains a model on a V100 GPU. The code itself does not log GPU memory usage.

Is there a SLURM command to query peak GPU memory usage once the job is finished?

2 Answers2

1

I am not sure that it would be possible to find load caused by running sbatch job itself. But you may try to check general utilization metric for your card. As I understand for nvidia there are nvidia-smi tool. I found other tools mentioned at this question.

So I would suggest to install nvidia-smi, and run it in separate terminal window with command like:

watch nvidia-smi

And then run your job. You should wee load changes to your card in real time.

One more possibility - trace your job with other profilers. Unfortunately I don't have nvidia card and can't check any of this tools but I suppose this will help you in your investigation.

user2986553
  • 390
  • 1
  • 4
  • Thanks for your answer! That is what I would do in a non-batch/queue scenario. But if jobs are submitted with SLURM, it does not help to watch `nvidia-smi` on the SLURM host, since the job will be executed on a different machine. (My login node does not have any GPUs). I'm interested to know whether GPU memory used by a finished job can be queried, presumably with a SLURM command. – Mathias Müller Mar 11 '20 at 12:40
  • As I understand, GPU performance counters is not standard metrics for linux and will varies based on hardware and driver version . Based on this I suppose that this data won't be available as simple metric from slurm. Here I googled [scontrol show job](https://www.brightcomputing.com/blog/blog/bid/172545/how-to-submit-a-simple-slurm-gpu-job-to-your-linux-cluster) command. I suppose it may provide some data regarding performance of specific job and node designation. Ofter figuring out node you may login there and try gather statistics with nvidia-smi – user2986553 Mar 11 '20 at 13:13
  • Or you can [assign your job to specific node](https://unix.stackexchange.com/questions/443438/how-to-submit-a-job-to-a-specific-node-using-slurms-sbatch-command). But still you need direct access to nodes. – user2986553 Mar 11 '20 at 13:14
  • Thanks for your suggestions, +1! I will probably accept my own answer, because in my opinion it answers my question more directly. – Mathias Müller Mar 11 '20 at 20:12
1

After talking to staff from our HPC team: it seems that

SLURM does not log GPU memory usage of running jobs submitted with sbatch.

Hence, this information cannot be recovered with any SLURM command. For instance, a command like

ssacct -j [job id]

does show general memory usage, but not GPU memory usage.

  • Did you find any other way of monitoring GPU memory usage, for a running (not finished) job? – GoodDeeds Sep 24 '21 at 17:23
  • @GoodDeeds I don't know whether you can monitor usage if the job is still running. But if you are using SLURM you could find out on which machine your job is being executed, request a shell login on exactly this machine and then use a tool like `nvidia-smi` for live monitoring. Or the job that is being executed can of course also itself query and log GPU usage. – Mathias Müller Sep 24 '21 at 18:25
  • I cannot do the first, but the second suggestion seems doable. Thanks! – GoodDeeds Sep 24 '21 at 18:26