Is there a reasonable/safe level of CPU load in the context of high performance computing?
I understand the meaning of load average for servers in general, but do not know what to expect for servers built and used for high performance computing.
Is the usual convention of load <= # of cores
applicable in this environment?
I'm curious given my system-specific details, where typically load >> # of cores
for each node:
- 24 physical cores, hyperthreading for 48 virtual cores (relatively new hardware)
- load averages: typically 100-300
The nodes have high uptime with usually high CPU usage/load. There are very few hardware failures, especially for CPUs, but I don't know what to expect over the lifetime of the nodes given high load.
Example top
output:
top - 14:12:53 up 4 days, 5:45, 1 user, load average: 313.33, 418.36, 522.87
Tasks: 501 total, 5 running, 496 sleeping, 0 stopped, 0 zombie
%Cpu(s): 33.5 us, 50.9 sy, 0.0 ni, 15.6 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 19650371+total, 46456320 free, 43582952 used, 10646443+buff/cache
KiB Swap: 13421772+total, 78065520 free, 56152200 used. 15164291+avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
85642 user 20 0 36.5g 7.6g 245376 S 1566 4.0 1063:21 python
97440 user 20 0 33.1g 5.3g 47460 S 1105 2.8 512:10.86 python
97297 user 20 0 31.0g 4.0g 69828 S 986.4 2.1 430:16.32 python
181854 user 20 0 19.3g 5.0g 19944 R 100.0 2.7 2823:09 python
...
Output of iostat -x 5 3
on the same server:
avg-cpu: %user %nice %system %iowait %steal %idle
50.48 0.00 12.06 0.38 0.00 37.08
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 350.41 705.68 58.12 22.24 2126.25 3393.61 137.36 6.02 74.93 9.10 246.94 1.19 9.56
dm-0 0.00 0.00 4.87 8.70 511.41 516.65 151.59 0.31 22.55 28.40 19.28 2.62 3.56
dm-1 0.00 0.00 403.67 719.23 1614.71 2876.92 8.00 8.83 7.10 7.38 6.95 0.08 9.05
dm-2 0.00 0.00 0.00 0.00 0.02 0.01 65.03 0.00 3.74 3.82 1.00 2.12 0.00