we currently have 3 out of 3 systems running CentOS in the exact same hardware and software configuration but are experiencing random system hangs. The occurrence can happen randomly as short as 20 minutes since boot or may take up to 1 or 2 weeks before it happens. We ran an independent live Ubuntu image and ran stress nonstop without any problems. We believe it might be a driver or software installed on our system but not sure how to determine what might be causing it.
How should we proceed if we want to determine what is causing our systems to hang?
KERNEL: /lib/debug/lib/modules/3.10.0-1062.12.1.el7.x86_64/vmlinux
DUMPFILE: /var/crash/127.0.0.1-2020-08-28-19:02:49/vmcore [PARTIAL DUMP]
CPUS: 72
DATE: Fri Aug 28 19:02:35 2020
UPTIME: 6 days, 13:03:56 LOAD AVERAGE: 7.87, 7.35, 7.45
TASKS: 5679
NODENAME: zagreb
RELEASE: 3.10.0-1062.12.1.el7.x86_64
VERSION: #1 SMP Tue Feb 4 23:02:59 UTC 2020
MACHINE: x86_64 (3000 Mhz)
MEMORY: 1023.4 GB
PANIC: "BUG: unable to handle kernel NULL pointer dereference at (null)"
PID: 19718
COMMAND: "9_scheduler"
TASK: ffff8a8bc9ab1070 [THREAD_INFO: ffff8a8be0618000]
CPU: 34
STATE: TASK_RUNNING (PANIC)
crash>
Here is a log of the backtrace:
crash> bt
PID: 19718 TASK: ffff8a8bc9ab1070 CPU: 34 COMMAND: "9_scheduler"
#0 [ffff8a8be061ba90] machine_kexec at ffffffff90665b34
#1 [ffff8a8be061baf0] __crash_kexec at ffffffff90722352
#2 [ffff8a8be061bbc0] crash_kexec at ffffffff90722440
#3 [ffff8a8be061bbd8] oops_end at ffffffff90d85798
#4 [ffff8a8be061bc00] no_context at ffffffff90675bb4
#5 [ffff8a8be061bc50] __bad_area_nosemaphore at ffffffff90675e82
#6 [ffff8a8be061bca0] bad_area_nosemaphore at ffffffff90675fa4
#7 [ffff8a8be061bcb0] __do_page_fault at ffffffff90d88750
#8 [ffff8a8be061bd20] do_page_fault at ffffffff90d88975
#9 [ffff8a8be061bd50] page_fault at ffffffff90d84778
[exception RIP: anon_vma_clone+117]
RIP: ffffffff908008e5 RSP: ffff8a8be061be08 RFLAGS: 00010286
RAX: ffff8a90d42e95f0 RBX: 0000000000000000 RCX: 0000000000ea39f5
RDX: 0000000000000040 RSI: 0000000000000200 RDI: ffff8a0f7fc07b00
RBP: ffff8a8be061be48 R8: 000000000001f0a0 R9: ffffffff908008d4
R10: ffff8ad35135e0c0 R11: 0000000000000000 R12: ffff8a90d42e9d18
R13: ffff8b0bea29d410 R14: ffff8a90d42e9cb0 R15: ffff8a90d42e95f0
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#10 [ffff8a8be061be50] __split_vma at ffffffff907f962e
#11 [ffff8a8be061be90] do_munmap at ffffffff907f992a
#12 [ffff8a8be061bee0] vm_munmap at ffffffff907f9cb5
#13 [ffff8a8be061bf30] sys_munmap at ffffffff907faf52
#14 [ffff8a8be061bf50] system_call_fastpath at ffffffff90d8dede
RIP: 00007f1ef3f82dd7 RSP: 00007f1e53ffebc0 RFLAGS: 00000246
RAX: 000000000000000b RBX: 0000000000040000 RCX: 00007f1ef3f6d727
RDX: 0000000000000003 RSI: 0000000000040000 RDI: 00007f1d2af40000
RBP: 0000000000922a40 R8: ffffffffffffffff R9: 0000000000000000
R10: 0000000000000022 R11: 0000000000000246 R12: 00007f1e53ffea58
R13: 00007f1d2af00000 R14: 0000000000000000 R15: 0000000000000000
ORIG_RAX: 000000000000000b CS: 0033 SS: 002b
crash>