Finding the source of a (memory read) hardware error

Question

When logging into my server, I'm seing lots of these errors:

Message from syslogd@****** at May 31 20:06:59 ...
 kernel:[500570.908383] mce: [Hardware Error]: PROCESSOR 0:206d7 TIME 1622484419 SOCKET 0 APIC 0 microcode 71a

Message from syslogd@****** at May 31 20:10:11 ...
 kernel:[500762.908155] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 5: c01d8a8000010091

Message from syslogd@****** at May 31 20:10:11 ...
 kernel:[500762.908278] mce: [Hardware Error]: TSC 0 

Message from syslogd@****** at May 31 20:10:11 ...
 kernel:[500762.908299] mce: [Hardware Error]: PROCESSOR 0:206d7 TIME 1622484611 SOCKET 0 APIC 0 microcode 71a

Message from syslogd@****** at May 31 20:11:10 ...
 kernel:[500821.884806] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 5: c01ec00000010091

Message from syslogd@****** at May 31 20:11:10 ...
 kernel:[500821.885130] mce: [Hardware Error]: TSC 0

And the syslog shows some memory read errors:

May 31 20:35:18 ****** kernel: [502269.884160] EDAC sbridge MC0: MISC 20403aba86 
May 31 20:35:18 ****** kernel: [502269.884166] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1622486118 SOCKET 0 APIC 0
May 31 20:35:18 ****** kernel: [502269.884228] EDAC MC0: 16682 CE memory read error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x170c7a offset:0xa00 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0091 socket:0 ha:0 channel_mask:2 rank:1)
May 31 20:35:19 ****** kernel: [502270.908292] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
May 31 20:35:19 ****** kernel: [502270.908349] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 5: cc12b44000010091
May 31 20:35:19 ****** kernel: [502270.908356] EDAC sbridge MC0: TSC 0 
May 31 20:35:19 ****** kernel: [502270.908359] EDAC sbridge MC0: ADDR 3ef245d00 
May 31 20:35:19 ****** kernel: [502270.908363] EDAC sbridge MC0: MISC 20404c4c86 
May 31 20:35:19 ****** kernel: [502270.908366] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1622486119 SOCKET 0 APIC 0
May 31 20:35:19 ****** kernel: [502270.908567] EDAC MC0: 19153 CE memory read error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#1 (channel:1 slot:1 page:0x3ef245 offset:0xd00 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0091 socket:0 ha:0 channel_mask:2 rank:4)

It seems I could have a faulty RAM module, but memtest86 shows everything OK. Could this be my CPU's fault?

Does your server have ECC ram modules? If yes then check the BIOS event log for memory errors. If you don't have ECC memory merest may find errors or not. Often memory errors can't be reproduced easily. — Robert, May 31 '21 at 19:05

score 4 · Answer 1 · answered May 31 '21 at 19:32

4

but memtest86 shows everything OK. Could this be my CPU's fault?

Yes, but here is what is more likely: You have ECC memory and it works.

Basically it fixes single bit errors transparently. It signals this - which the OS is smart enough to intercept and log.

Memtest is too primitive for this, and does not intercept the notification, all it sees is that the test passes, because ECC fixes the errors.

answered May 31 '21 at 19:32

TomTom

50,857
7
52
134

Yes, I have REG ECC. Should I be worried about these errors? Is there a way to prevent them from bloating my command line randomly? – dvilela May 31 '21 at 19:36
1

Yes, Generally these errors should be VERY few at max - better zero. They indicate some issue. I would at a minimum take the server off, remove all ram, use compressed air to properly clean things and reseat the RAM. If those errors are not few and far between, it is a sign of the system not working properly. You do not rely on your safety net saving your ass - that is like havingt a Raid array with one disc failed and keeping it that way. – TomTom May 31 '21 at 19:38
Yes, I already cleaned the modules and reseated them, but I still see 1-2 errors per second. I was thinking about buying 2 new modules (it seems all erros come from the same channel), but I wanted to discard CPU or other faulty hardware issues. – dvilela May 31 '21 at 19:46
2

IIRC there is little chance this is the CPU. And you can rule out the Mobo ty swapping the memory around and see whether the defective indications move. If yes -> RAM. If no -> Mobo. – TomTom May 31 '21 at 19:53
I swapped the ram modules across channels, and after a day without errors I'm starting to see those messages again, and from the same channel. So, mobo or cpu? How can I find? – dvilela Jun 04 '21 at 10:18
1

Just replace one, then the next. I mean, seriously, have 2 spare servers for this reason. Double redundancy. – TomTom Jun 04 '21 at 12:01

Finding the source of a (memory read) hardware error

1 Answers1