0

I am unfortunately having a problem with my HP ProLiant DL380e G8 server running Fedora Server 34. I suspect these are memory errors or a DIMM being/going bad, however I'm not sure.

Feedback is very welcome!

I've ran journalctl -r, which returns the following output in the PasteBin link (a snippet of what looks out of the ordinary): https://pastebin.com/KPUZHceD

All help and ideas are appreciated!

Kind regards

Edit: In response to the comment of @Michael Hampton: The output posted here:

<27>Sep  7 17:03:51 mcelog: Location: SOCKET:0 CHANNEL:3 DIMM:1 []
Sep 07 17:03:51 turbo mcelog[1304]: Location: SOCKET:0 CHANNEL:3 DIMM:1 []
Sep 07 17:03:51 turbo mcelog[1303]: <27>Sep  7 17:03:51 mcelog: corrected DIMM memory error count exceeded threshold: 10 in 24h
Sep 07 17:03:51 turbo mcelog[1303]: corrected DIMM memory error count exceeded threshold: 10 in 24h
Sep 07 17:03:51 turbo mcelog[1304]: <27>Sep  7 17:03:51 mcelog: Location: SOCKET:0 CHANNEL:3 DIMM:1 []
Sep 07 17:03:51 turbo mcelog[1304]: Location: SOCKET:0 CHANNEL:3 DIMM:1 []
Sep 07 17:03:51 turbo mcelog[1303]: <27>Sep  7 17:03:51 mcelog: corrected DIMM memory error count exceeded threshold: 10 in 24h
Sep 07 17:03:51 turbo mcelog[1303]: corrected DIMM memory error count exceeded threshold: 10 in 24h
Sep 07 17:03:51 turbo mcelog[1067]: CPUID Vendor Intel Family 6 Model 45 Step 7
Sep 07 17:03:51 turbo mcelog[1067]: MICROCODE 71a
Sep 07 17:03:51 turbo mcelog[1067]: MCGCAP 1000812 APICID 2 SOCKETID 0
Sep 07 17:03:51 turbo mcelog[1067]: STATUS c80000c400800093 MCGSTATUS 0
Sep 07 17:03:51 turbo mcelog[1067]: MemCtrl:
Sep 07 17:03:51 turbo mcelog[1067]: Transaction: Memory read error
Sep 07 17:03:51 turbo mcelog[1067]: MCA: MEMORY CONTROLLER RD_CHANNEL3_ERR
Sep 07 17:03:51 turbo mcelog[1067]: MCi_MISC register valid
Sep 07 17:03:51 turbo mcelog[1067]: Corrected error
Sep 07 17:03:51 turbo mcelog[1067]: Error overflow
Sep 07 17:03:51 turbo mcelog[1067]: MCi status:
Sep 07 17:03:51 turbo mcelog[1067]: MCG status:
Sep 07 17:03:51 turbo mcelog[1067]: TIME 1631027031 Tue Sep  7 17:03:51 2021
Sep 07 17:03:51 turbo mcelog[1067]: MISC d22131295c834800
Sep 07 17:03:51 turbo mcelog[1067]: CPU 1 BANK 11
Sep 07 17:03:51 turbo mcelog[1067]: MCE 7
Sep 07 17:03:51 turbo mcelog[1067]: Hardware event. This is not a software error.
Sep 07 17:03:51 turbo mcelog[1067]: CPUID Vendor Intel Family 6 Model 45 Step 7
Sep 07 17:03:51 turbo mcelog[1067]: MICROCODE 71a
Sep 07 17:03:51 turbo mcelog[1067]: MCGCAP 1000812 APICID 3 SOCKETID 0
Sep 07 17:03:51 turbo mcelog[1067]: STATUS c80000c400800093 MCGSTATUS 0
Sep 07 17:03:51 turbo mcelog[1067]: MemCtrl:
Sep 07 17:03:51 turbo mcelog[1067]: Transaction: Memory read error
Sep 07 17:03:51 turbo mcelog[1067]: MCA: MEMORY CONTROLLER RD_CHANNEL3_ERR
Sep 07 17:03:51 turbo mcelog[1067]: MCi_MISC register valid
Sep 07 17:03:51 turbo mcelog[1067]: Corrected error
Sep 07 17:03:51 turbo mcelog[1067]: Error overflow
Sep 07 17:03:51 turbo mcelog[1067]: MCi status:
Sep 07 17:03:51 turbo mcelog[1067]: MCG status:
Sep 07 17:03:51 turbo mcelog[1067]: TIME 1631027031 Tue Sep  7 17:03:51 2021
Sep 07 17:03:51 turbo mcelog[1067]: MISC d22131295c834800
Sep 07 17:03:51 turbo mcelog[1067]: CPU 13 BANK 11
Sep 07 17:03:51 turbo mcelog[1067]: MCE 6
Sep 07 17:03:51 turbo mcelog[1067]: Hardware event. This is not a software error.
Sep 07 17:03:51 turbo mcelog[1067]: CPUID Vendor Intel Family 6 Model 45 Step 7
Sep 07 17:03:51 turbo mcelog[1067]: MICROCODE 71a
Sep 07 17:03:51 turbo mcelog[1067]: MCGCAP 1000812 APICID 0 SOCKETID 0
Sep 07 17:03:51 turbo mcelog[1067]: STATUS c80000c400800093 MCGSTATUS 0
Sep 07 17:03:51 turbo mcelog[1067]: MemCtrl:
Sep 07 17:03:51 turbo mcelog[1067]: Transaction: Memory read error
Sep 07 17:03:51 turbo mcelog[1067]: MCA: MEMORY CONTROLLER RD_CHANNEL3_ERR
Sep 07 17:03:51 turbo mcelog[1067]: MCi_MISC register valid
Sep 07 17:03:51 turbo mcelog[1067]: Corrected error
Sep 07 17:03:51 turbo mcelog[1067]: Error overflow
Sep 07 17:03:51 turbo mcelog[1067]: MCi status:
Sep 07 17:03:51 turbo mcelog[1067]: MCG status:
Sep 07 17:03:51 turbo mcelog[1067]: TIME 1631027031 Tue Sep  7 17:03:51 2021
Sep 07 17:03:51 turbo mcelog[1067]: MISC d22131295c834800
Sep 07 17:03:51 turbo mcelog[1067]: CPU 0 BANK 11
Sep 07 17:03:51 turbo mcelog[1067]: MCE 5
Sep 07 17:03:51 turbo mcelog[1067]: Hardware event. This is not a software error.
Sep 07 17:03:51 turbo mcelog[1067]: Running trigger `dimm-error-trigger' (reporter: memdb)
Sep 07 17:03:51 turbo mcelog[1067]: CPUID Vendor Intel Family 6 Model 45 Step 7
Sep 07 17:03:51 turbo mcelog[1067]: MICROCODE 71a
Sep 07 17:03:51 turbo mcelog[1067]: MCGCAP 1000812 APICID 6 SOCKETID 0
Sep 07 17:03:51 turbo mcelog[1067]: STATUS c80000c400800093 MCGSTATUS 0
Sep 07 17:03:51 turbo mcelog[1067]: MemCtrl:
Sep 07 17:03:51 turbo mcelog[1067]: Transaction: Memory read error
Sep 07 17:03:51 turbo mcelog[1067]: MCA: MEMORY CONTROLLER RD_CHANNEL3_ERR
Sep 07 17:03:51 turbo mcelog[1067]: MCi_MISC register valid
Sep 07 17:03:51 turbo mcelog[1067]: Corrected error
Sep 07 17:03:51 turbo mcelog[1067]: Error overflow
Sep 07 17:03:51 turbo mcelog[1067]: MCi status:
Sep 07 17:03:51 turbo mcelog[1067]: MCG status:
Sep 07 17:03:51 turbo mcelog[1067]: TIME 1631027031 Tue Sep  7 17:03:51 2021
Sep 07 17:03:51 turbo mcelog[1067]: MISC d22131295c834800
Sep 07 17:03:51 turbo mcelog[1067]: CPU 3 BANK 11
Sep 07 17:03:51 turbo mcelog[1067]: MCE 4
Sep 07 17:03:51 turbo mcelog[1067]: Hardware event. This is not a software error.
Sep 07 17:03:51 turbo mcelog[1067]: CPUID Vendor Intel Family 6 Model 45 Step 7
Sep 07 17:03:51 turbo mcelog[1067]: MICROCODE 71a
Sep 07 17:03:51 turbo mcelog[1067]: MCGCAP 1000812 APICID a SOCKETID 0
Sep 07 17:03:51 turbo mcelog[1067]: STATUS c801c00400800093 MCGSTATUS 0
Sep 07 17:03:51 turbo mcelog[1067]: MemCtrl:
Sep 07 17:03:51 turbo mcelog[1067]: Transaction: Memory read error
Sep 07 17:03:51 turbo mcelog[1067]: MCA: MEMORY CONTROLLER RD_CHANNEL3_ERR
Sep 07 17:03:51 turbo mcelog[1067]: MCi_MISC register valid
Sep 07 17:03:51 turbo mcelog[1067]: Corrected error
Sep 07 17:03:51 turbo mcelog[1067]: Error overflow
Sep 07 17:03:51 turbo mcelog[1067]: MCi status:
Sep 07 17:03:51 turbo mcelog[1067]: MCG status:
Sep 07 17:03:51 turbo mcelog[1067]: TIME 1631027031 Tue Sep  7 17:03:51 2021
Sep 07 17:03:51 turbo mcelog[1067]: MISC d2213fa689118800
Sep 07 17:03:51 turbo mcelog[1067]: CPU 5 BANK 11
Sep 07 17:03:51 turbo mcelog[1067]: MCE 3
Sep 07 17:03:51 turbo mcelog[1067]: Hardware event. This is not a software error.
Sep 07 17:03:51 turbo mcelog[1067]: CPUID Vendor Intel Family 6 Model 45 Step 7
Sep 07 17:03:51 turbo mcelog[1067]: MICROCODE 71a
Sep 07 17:03:51 turbo mcelog[1067]: MCGCAP 1000812 APICID 5 SOCKETID 0
Sep 07 17:03:51 turbo mcelog[1067]: STATUS c801bd8400800093 MCGSTATUS 0
Sep 07 17:03:51 turbo mcelog[1067]: MemCtrl:
Sep 07 17:03:51 turbo mcelog[1067]: Transaction: Memory read error
Sep 07 17:03:51 turbo mcelog[1067]: MCA: MEMORY CONTROLLER RD_CHANNEL3_ERR
Sep 07 17:03:51 turbo mcelog[1067]: MCi_MISC register valid
Sep 07 17:03:51 turbo mcelog[1067]: Corrected error
Sep 07 17:03:51 turbo mcelog[1067]: Error overflow
Sep 07 17:03:51 turbo mcelog[1067]: MCi status:
Sep 07 17:03:51 turbo mcelog[1067]: MCG status:
Sep 07 17:03:51 turbo mcelog[1067]: TIME 1631027031 Tue Sep  7 17:03:51 2021
Sep 07 17:03:51 turbo mcelog[1067]: MISC d2213f0649118800
Sep 07 17:03:51 turbo mcelog[1067]: CPU 14 BANK 11
Sep 07 17:03:51 turbo mcelog[1067]: MCE 2
Sep 07 17:03:51 turbo mcelog[1067]: Hardware event. This is not a software error.
Sep 07 17:03:51 turbo mcelog[1067]: CPUID Vendor Intel Family 6 Model 45 Step 7
Sep 07 17:03:51 turbo mcelog[1067]: MICROCODE 71a
Sep 07 17:03:51 turbo mcelog[1067]: MCGCAP 1000812 APICID 1 SOCKETID 0
Sep 07 17:03:51 turbo mcelog[1067]: STATUS c801bec400800093 MCGSTATUS 0
Sep 07 17:03:51 turbo mcelog[1067]: MemCtrl:
Sep 07 17:03:51 turbo mcelog[1067]: Transaction: Memory read error
Sep 07 17:03:51 turbo mcelog[1067]: MCA: MEMORY CONTROLLER RD_CHANNEL3_ERR
Sep 07 17:03:51 turbo mcelog[1067]: MCi_MISC register valid
Sep 07 17:03:51 turbo mcelog[1067]: Corrected error
Sep 07 17:03:51 turbo mcelog[1067]: Error overflow
Sep 07 17:03:51 turbo mcelog[1067]: MCi status:
Sep 07 17:03:51 turbo mcelog[1067]: MCG status:
Sep 07 17:03:51 turbo mcelog[1067]: TIME 1631027031 Tue Sep  7 17:03:51 2021
Sep 07 17:03:51 turbo mcelog[1067]: MISC d221196e09118800
Sep 07 17:03:51 turbo mcelog[1067]: CPU 12 BANK 11
Sep 07 17:03:51 turbo mcelog[1067]: MCE 1
Sep 07 17:03:51 turbo mcelog[1067]: Hardware event. This is not a software error.
Sep 07 17:03:51 turbo mcelog[1067]: CPUID Vendor Intel Family 6 Model 45 Step 7
Sep 07 17:03:51 turbo mcelog[1067]: MICROCODE 71a
Sep 07 17:03:51 turbo mcelog[1067]: MCGCAP 1000812 APICID 0 SOCKETID 0
Sep 07 17:03:51 turbo mcelog[1067]: STATUS c0107b4000010093 MCGSTATUS 0
Sep 07 17:03:51 turbo mcelog[1067]: Transaction: Memory read error
Sep 07 17:03:51 turbo mcelog[1067]: MCA: MEMORY CONTROLLER RD_CHANNEL3_ERR
Sep 07 17:03:51 turbo mcelog[1067]: Corrected error
Sep 07 17:03:51 turbo mcelog[1067]: Error overflow
Sep 07 17:03:51 turbo mcelog[1067]: STATUS c0107b4000010093 MCGSTATUS 0
Sep 07 17:03:51 turbo mcelog[1067]: Transaction: Memory read error
Sep 07 17:03:51 turbo mcelog[1067]: MCA: MEMORY CONTROLLER RD_CHANNEL3_ERR
Sep 07 17:03:51 turbo mcelog[1067]: Corrected error
Sep 07 17:03:51 turbo mcelog[1067]: Error overflow
Sep 07 17:03:51 turbo mcelog[1067]: MCi status:
Sep 07 17:03:51 turbo mcelog[1067]: MCG status:
Sep 07 17:03:51 turbo mcelog[1067]: TIME 1631027031 Tue Sep  7 17:03:51 2021
Sep 07 17:03:51 turbo mcelog[1067]: CPU 0 BANK 5
Sep 07 17:03:51 turbo mcelog[1067]: MCE 0
Sep 07 17:03:51 turbo mcelog[1067]: Hardware event. This is not a software error.
Sep 07 17:03:51 turbo mcelog[1067]: mcelog: mcelog read: Input/output error
Sep 07 17:03:51 turbo kernel: ERST: [Firmware Warn]: Firmware does not respond in time.
Sep 07 17:03:51 turbo kernel: mce: [Hardware Error]: Machine check events logged
Sep 07 17:03:51 turbo kernel: mce: [Hardware Error]: Machine check events logged
Sep 07 17:03:51 turbo kernel: mce_notify_irq: 6 callbacks suppressed
jonasclaes
  • 101
  • 7
  • 1
    I can't see whatever you put on pastebin. Maybe their web site is acting up. In any case, is it really too long to be posted here? We prefer that everything relevant to the question be posted in the question whenever possible. – Michael Hampton Sep 07 '21 at 15:38
  • That's not a supported confuguration. – Chopper3 Sep 07 '21 at 15:47
  • @Chopper3 can you explain to me why? – jonasclaes Sep 07 '21 at 16:03
  • 3
    I think the first thing I would do is remove the faulty memory. The log does clearly identify it. – Michael Hampton Sep 07 '21 at 16:13
  • 1
    `Location: SOCKET:0 CHANNEL:3 DIMM:1` remove this ram never seen such a clear defective and @chopper3 please explain I don't see an issue on this question even if the server is quite old – djdomi Sep 07 '21 at 17:03
  • 1
    @jonasclaes because all servers have a list of supported operating systems, and the supported version, and you have an unsupported configuration (https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwiN8NmwrO3yAhUMgVwKHTRiDxoQFnoECAsQAQ&url=https%3A%2F%2Fh20195.www2.hpe.com%2Fv2%2Fgetdocument.aspx%3Fdocname%3Dc04128166&usg=AOvVaw2CoUn_gMG6ueVHh8DAmkZE - page 8) – Chopper3 Sep 07 '21 at 17:05
  • 1
    @Chopper3 I see. However RHEL is enterprise and Fedora is community driven. But thanks for pointing that out. – jonasclaes Sep 14 '21 at 18:32

1 Answers1

0

This post has been fixed by removing 2 faulty RAM sticks from the server and reseating the CPU, since that was not making good contact either.

Thanks for all the help!

jonasclaes
  • 101
  • 7