1

I recently acquired used Dell R320 with Xeon E5-2450 v1, all firmware's are updated to most recent versions using Lifecycle controller. On boot dmesg reports:

microcode: microcode updated early to revision 0x71a, date = 2020-03-24 [   12.384040] clocksource: timekeeping watchdog on CPU9: Marking clocksource 'tsc' as unstable because the skew is too large: [  
12.395572] clocksource:                       'hpet' wd_now: 3b1bb82 wd_last: 2e247ff mask: ffffffff [   12.413476] clocksource:            'tsc' cs_now: 1c62267fd4b cs_last: 1c30b8dcf7f mask: ffffffffffffffff [   12.425567] tsc: Marking TSC unstable due to clocksource watchdog [
12.431666] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.

Then if i run phoronix-test-suite stress-run stress-ng system after aprox. one minute become unresponsive.

During test i see watchdog events from network adapter:

[  705.412997] NETDEV WATCHDOG: eno1 (tg3): transmit queue 0 timed out
[  705.412997] WARNING: CPU: 9 PID: 6812 at net/sched/sch_generic.c:473 dev_watchdog+0x27d/0x281
[  705.412997] Modules linked in: xt_CHECKSUM ipt_REJECT nf_nat_tftp nft_objref nf_conntrack_tftp nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nf_tables_set tun rfkill scsi_transport_iscsi ip_set xt_conntrack xt_multiport xt_nat xt_addrtype xt_mark xt_MASQUERADE nft_counter xt_comment nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 veth sunrpc iTCO_wdt intel_rapl_msr iTCO_vendor_support dcdbas intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel vfat fat kvm irqbypass crct10dif_pclmul crc32_pclmul mgag200 ghash_clmulni_intel drm_vram_helper aesni_intel ttm crypto_simd cryptd glue_helper drm_kms_helper pcspkr drm syscopyarea sysfillrect sysimgblt fb_sys_fops lpc_ich i2c_algo_bit zfs(POE) joydev zunicode(POE) zzstd(OE) zlua(OE) mei_me zavl(POE) mei icp(POE) zcommon(POE) znvpair(POE) ipmi_ssif spl(OE) ioatdma dca ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter
[  705.412997]  sch_fq_codel ip_tables xfs libcrc32c sd_mod sg ahci libahci libata mpt3sas tg3 raid_class scsi_transport_sas wmi fuse
[  705.412997] CPU: 9 PID: 6812 Comm: stress-ng Kdump: loaded Tainted: P           OE     5.4.17-2136.300.7.el8uek.x86_64 #2
[  705.412997] Hardware name: Dell Inc. PowerEdge R320/0KM5PX, BIOS 2.4.2 01/29/2015
[  705.412997] RIP: 0010:dev_watchdog+0x27d/0x281
[  705.412997] Code: 48 85 c0 75 e6 eb a0 4c 89 e7 c6 05 9b 59 17 01 01 e8 c7 a9 fa ff 89 d9 4c 89 e6 48 c7 c7 68 3b 53 ac 48 89 c2 e8 be f1 82 ff <0f> 0b eb 82 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 66 66 66
[  705.412997] RSP: 0000:ffffac6d003d0e50 EFLAGS: 00010282
[  705.412997] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000006
[  705.412997] RDX: 0000000000000007 RSI: 0000000000000092 RDI: ffff9e853f457d00
[  705.412997] RBP: ffffac6d003d0e80 R08: 0000000000000514 R09: 00000000ffffffff
[  705.412997] R10: 0000000000000000 R11: ffff9e851d84f3d0 R12: ffff9e850d8e4000
[  705.412997] R13: 0000000000000005 R14: ffff9e850d8e4480 R15: ffff9e8537d377c0
[  705.412997] FS:  00007fa4baba5740(0000) GS:ffff9e853f440000(0000) knlGS:0000000000000000
[  705.412997] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  705.412997] CR2: 00007f54983fad0c CR3: 0000000b99992006 CR4: 00000000000606e0
[  705.412997] Call Trace:
[  705.412997]  <IRQ>
[  705.412997]  ? pfifo_fast_enqueue+0x160/0x151
[  705.412997]  call_timer_fn+0x32/0x12c
[  705.412997]  run_timer_softirq+0x1a5/0x42e
[  705.412997]  __do_softirq+0xe1/0x2e7
[  705.412997]  ? hrtimer_interrupt+0x12a/0x222
[  705.412997]  irq_exit+0xf3/0xf8
[  705.412997]  smp_apic_timer_interrupt+0x79/0x130
[  705.412997]  apic_timer_interrupt+0xf/0x14
[  705.412997]  </IRQ>

If i add mitigations = off to kernel command-line parameters on boot, phoronix lasts from 4 to 7 minutes and the system again become unresponsive. The same stuff happens with KVM guests, tried to install Debian 11 5 times, install freezes during either initial package install or kernel unpack.

screen of freeze messages: https://ibb.co/k2Jk4QG

Does anyone had similar issues ? Thanks !

P.S.: current kernel 5.4.17-2136.300.7.el8uek.x86_64, also tried with 4.18.0-305.19.1.el8_4.x86_64 without any difference

valc
  • 11
  • 4
  • did you add the Intel microcode package as well? – John Greene Nov 14 '21 at 10:22
  • Yes, i had. Even more - i checked with all previous microcodes found at [win-raid forum](https://www.win-raid.com/t5709f47-OFFER-Intel-CPU-Microcode-Archives.html). btw now i switched to debian 11, the system become a little bit more stable, phoronix test still able to crash the system but after 15 minutes... I ordered Xeon E5-2470v2 hope it will resolve the issue. I'll add results later – valc Nov 15 '21 at 05:32
  • I see a spinlock issue for the scheduler level during interrupt state. is the crash point consistent between each failed attempts? – John Greene Nov 16 '21 at 14:00
  • also i noticed a sysvec_acpi in the crash output and Dell BIOS is circa 2015, so i would try to remove some ACPI at the kernel line. – John Greene Nov 16 '21 at 14:11
  • thank You for replying. Yes, the crashpoint was consistent between tests. Which ACPI tables would you recommend to drop ? – valc Nov 17 '21 at 05:05
  • can you do a `dmidecode` for the mobo’s BIOS version and whether the Dell mobo firmware is the latest? – John Greene Nov 17 '21 at 16:09
  • Historically, memtest would uncover any strange bitflips and that is my current thinking. I would do the following: boot up an older CD distro and see how that goes. if it fails, then it’s a hardware issue. at any rate, First HW swap out would be the reduction of memory DIMM, depopulate it to the bare minimum and try again. if it fails, swap it out until it passes. – John Greene Nov 17 '21 at 16:18
  • Hi, please find [dmidecode](https://gist.github.com/ValentinChirikov/f5c3d3fc2cee63c240dcddda4cc50d6a#file-gistfile1-txt) – valc Nov 18 '21 at 10:20
  • currently i wait for the parcel with E5-2470v2, certainly i'll do memtest before CPU swap, and will post results here, thanks ! – valc Nov 18 '21 at 10:22
  • i still think you should depopulate MemChips and get a passing result before CPU Swap. – John Greene Nov 18 '21 at 10:57
  • 1
    Finally i received E5-2470v2 switched CPU - and all problem's are gone away, no freezes, no problem with TSC, phoronix stress-run stress-ng passes with no problems. Thanks for commitment, i am closing the problem. – valc Nov 25 '21 at 19:33
  • Congratulations! You are the second one with CPU issue that I know of with Xeon. Sounds like a popped capacitor inside the CPU die. – John Greene Nov 26 '21 at 11:54
  • Thanks ! Initially the case for cpu switching was performance, but in fact it seems the CPU die was really damaged. – valc Nov 27 '21 at 12:39

1 Answers1

0

Switching CPU to E5-2470v2 solved the problem, seems previous CPU was somehow broken.

valc
  • 11
  • 4