IBM LSF Suite for Enterprise 10.2.0.12 LIM Segmentation Fault

Question

I upgraded from IBM LSF Suite for Enterprise 10.2.0.10 to version 10.2.0.12,and now, on only one of our GPU cluster servers (1 out of 8), I can't get the LIM service to stay running. It keeps crashing with a segmentation fault:

lim[42062]: segfault at 0 ip 00007f63476c07f7 sp 00007f6345218958 error 4 in libc-2.27.so[7f6347607000+1e7000]

The process seg faults generally after a job has been submitted to the server or has finished there. If there is a running job on the server, the LIM and its child processes fail after a minute or so after starting.

Since we are using the IBM "Academic Initiative", in a Bioinformatics university chair, we have no access to support or Fix Packs, other than major releases.

nvidia-smi shows the following, currently:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro RTX 8000     On   | 00000000:1A:00.0 Off |                  Off |
| 33%   40C    P8    25W / 260W |   3968MiB / 48601MiB |      0%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Quadro RTX 8000     On   | 00000000:3E:00.0 Off |                  Off |
| 33%   25C    P8    12W / 260W |      1MiB / 48601MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Quadro RTX 8000     On   | 00000000:89:00.0 Off |                  Off |
| 33%   24C    P8    21W / 260W |      1MiB / 48601MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Quadro RTX 8000     On   | 00000000:B1:00.0 Off |                  Off |
| 33%   24C    P8    15W / 260W |      1MiB / 48601MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

I managed to get a core dump of the segmentation fault and ran it through gdb. Here is the backtrace some further inspection:

(gdb) bt
#0  __strcat_sse2_unaligned () at ../sysdeps/x86_64/multiarch/strcpy-sse2-unaligned.S:298
#1  0x00000000004efa5c in getNvidiaGpu (index=-1408930708, dev=0x7f7dac056810, allDevices=0xbdd9, errorGPU=0x0, errorCount=0, warningGPU=0x7f7dac011730, warningCnt=2) at lim.gpu.c:580
#2  0x00000000004f074b in getGpuReportFullThreadFunc () at lim.gpu.c:858
#3  0x00000000004f11ad in collectGpuInfoThread (arg=0x7f7dac056c6d) at lim.gpu.c:949
#4  0x00007f7db92756db in start_thread (arg=0x7f7db5ec8700) at pthread_create.c:463
#5  0x00007f7db83d771f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Here is the assembly where it fails:

=> 0x00007f7db836f7f7 <+1255>:  movdqu (%rsi),%xmm1

And here we see that the memory address of rsi is 0, or NULL pointer

rsi            0x0      0

#0  __strcat_sse2_unaligned () at ../sysdeps/x86_64/multiarch/strcpy-sse2-unaligned.S:298
No locals.
#1  0x00000000004efa5c in getNvidiaGpu (index=-1408930708, dev=0x7f7dac056810, allDevices=0xbdd9, errorGPU=0x0, errorCount=0, warningGPU=0x7f7dac011730, warningCnt=2) at lim.gpu.c:580
fname = 0x7d6878 "getNvidiaGpu"
modelname = "QuadroRTX8000", '\000' <repeats 242 times>
device = 0x7f7db79b3e58
memory = {total = 50962169856, free = 42197254144, used = 8764915712}
pState = NVML_PSTATE_2
utilization = {gpu = 100, memory = 49}
computeMode = NVML_COMPUTEMODE_DEFAULT
temperature = 83
vsbecc = 0
vdbecc = 0
power = 249652
i = 0
j = 0
#2  0x00000000004f074b in getGpuReportFullThreadFunc () at lim.gpu.c:858
dev = 0x7f7dac056810
fname = "getGpuReportFullThreadFunc"
dGlobal = 0x7f7dac001c70
errorGPU = 0x0
warningGPU = 0x7f7dac011730
allDevices = 0x7f7dac00a850
ret = 2886036588
ret1 = 2886036588
ver = {major = 2885721120, minor = 32637, patch = 4294967168, build = 0x11 <error: Cannot access memory at address 0x11>}
rsmi_cnt = 0
nvml_cnt = 4
majorTmp = "11\000\000\000\000\000"
compMajorV = <optimized out>
compMinorV = <optimized out>
majorVer = <optimized out>
majorV = 470
minorV = 57
errorCount = 0
warningCnt = 2
i = 0
gpu_lib = -1408931824
nvmlOpened = 1
#3  0x00000000004f11ad in collectGpuInfoThread (arg=0x7f7dac056c6d) at lim.gpu.c:949
fname = "collectGpuInfoThread"
gpuinfo = 0x7f7dac001c70
gpuinfoError = 0
sampleInterval = 5
#4  0x00007f7db92756db in start_thread (arg=0x7f7db5ec8700) at pthread_create.c:463
pd = 0x7f7db5ec8700
now = <optimized out>
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140177899816704, -4327163297919163674, 140177899814848, 0, 0, 10252544, 4398249031032873702, 4398224247775797990}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
not_first_call = <optimized out>
#5  0x00007f7db83d771f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

With all that being said, we have another server, with the exact same specifications, that does not have this problem. The NVIDIA CUDA and driver versions are also the same, running the same version of Ubuntu, version 18.04.06 LTS.

The LSF installation is using a shared configuration over NFS - meaning each server is accessing the same configuration files and scripts.

The only differences I can see between the other servers and the one with the problem is in the command option used to start LIM:

On all the other servers:

root     53635  1.8  0.0 277728 18844 ?        S<sl Feb07 472:40 /opt/ibm/lsfsuite/lsf/10.1/linux2.6-glibc2.3-x86_64/etc/lim -d /opt/ibm/lsfsuite/lsf/conf/ego/rost_lsf_cluster_1/kernel
root     53639  0.0  0.0  18652  5976 ?        S<s  Feb07   0:11  \_ /opt/ibm/lsfsuite/lsf/10.1/linux2.6-glibc2.3-x86_64/etc/melim
root     53645  0.0  0.0 4681288 14400 ?       S<l  Feb07   6:26  |   \_ /opt/ibm/lsfsuite/lsf/10.1/linux2.6-glibc2.3-x86_64/etc/lsfbeat -c /opt/ibm/lsfsuite/lsf/conf/lsfbeats/lsfbeat.yml
root     53640  0.0  0.0  21268  9136 ?        S    Feb07   7:56  \_ /opt/ibm/lsfsuite/lsf/10.1/linux2.6-glibc2.3-x86_64/etc/pim -d /opt/ibm/lsfsuite/lsf/conf/ego/rost_lsf_cluster_1/kernel
root     53641  0.0  0.0  39576  9604 ?        Sl   Feb07   0:42  \_ /opt/ibm/lsfsuite/lsf/10.1/linux2.6-glibc2.3-x86_64/etc/pem

On the one with the segmentation fault:

root     44902  1.8  0.0 272472 16680 ?        D<sl 12:17   0:00 /opt/ibm/lsfsuite/lsf/10.1/linux2.6-glibc2.3-x86_64/etc/lim
root     44919  4.4  0.0  18656  6500 ?        S<s  12:17   0:00  \_ /opt/ibm/lsfsuite/lsf/10.1/linux2.6-glibc2.3-x86_64/etc/melim
root     44924  2.2  0.0 468764 11280 ?        S<l  12:17   0:00  |   \_ /opt/ibm/lsfsuite/lsf/10.1/linux2.6-glibc2.3-x86_64/etc/lsfbeat -c /opt/ibm/lsfsuite/lsf/conf/lsfbeats/lsfbeat.yml
root     44920  5.6  0.0  19276  7364 ?        S    12:17   0:00  \_ /opt/ibm/lsfsuite/lsf/10.1/linux2.6-glibc2.3-x86_64/etc/pim
root     44921  4.6  0.0  39576 10288 ?        Sl   12:17   0:00  \_ /opt/ibm/lsfsuite/lsf/10.1/linux2.6-glibc2.3-x86_64/etc/pem

I tried restarting the services using bctrld on both the master and server, in addition to using the lsfd.service unit... even starting the lim service manually using the -d /opt/ibm/lsfsuite/lsf/conf/ego/rost_lsf_cluster_1/kernel options. All produce a segmentation fault.

Does anyone have any idea what the problem is, or how to fix it? I'm going crazy here.

Thank you very much for taking the time to read this and offer your feedback!

IBM LSF Suite for Enterprise 10.2.0.12 LIM Segmentation Fault

0 Answers0