3

I woke up this morning to what's a first for me; one of my systems had logged DRAM ECC error notifications. Three of them, in fact, for as far as I can tell the exact same memory location (obviously, the system isn't actually named localhost):

Aug 31 05:00:46 localhost kernel: [719099.816034] [Hardware Error]: CPU:0   MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]: 0x9c6c40006b080a13
Aug 31 05:00:46 localhost kernel: [719099.816046] [Hardware Error]:         MC4_ADDR: 0x0000000641f49d20
Aug 31 05:00:46 localhost kernel: [719099.816051] [Hardware Error]: Northbridge Error (node 0): DRAM ECC error detected on the NB.
Aug 31 05:00:46 localhost kernel: [719099.816059] EDAC amd64 MC0: CE ERROR_ADDRESS= 0x641f49d20
Aug 31 05:00:46 localhost kernel: [719099.816070] EDAC MC0: CE page 0x641f49, offset 0xd20, grain 0, syndrome 0x6bd8, row 2, channel 0, label "": amd64_edac
Aug 31 05:00:46 localhost kernel: [719099.816075] [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)

The above was followed by an identical notification at system time 05:10:46 (719699.8160) and then one more at 05:20:46 (720299.8160) which also had Over on the CPU:0 MC4_STATUS line (status 0xdc6c40006b080813). So far the system has been stable since, with no further errors logged. System activity is normal, and the system in question has been running with ECC RAM since 2014 but never logged any ECC errors.

I wouldn't be too worried about a single correctable ECC error. The almost exactly ten minutes (down to a few microseconds, in fact) in between the errors being logged could be simply for RAM scrubbing happening every ten minutes; unfortunately, on this particular system, the scrub interval is not exposed as a setting. However, the three consecutive errors in the same memory location (same value for CE ERROR_ADDRESS) does have me a little bit concerned.

Update: The host in question has logged several more since I originally posted this question, all with the same value for CE ERROR_ADDRESS.

How seriously should I take this? What's a good response; order replacement RAM right away and schedule to install it ASAP, treat this as just a momentary glitch, or be on toes to replace RAM if it happens again but no specific action right now?

user
  • 4,267
  • 4
  • 32
  • 70

3 Answers3

2

ECC RAM tends to be used on critical servers. The system is reporting a hardware failure. If it's not a critical system and you don't mind everything going through it potentially corrupting, sure wait and see what happens, but if you care about your data more than the cost of the RAM replace the faulty RAM ASAP.

Tim
  • 30,383
  • 6
  • 47
  • 77
  • 1
    In all fairness, the system is reporting a *corrected* hardware failure. I'm not saying that these should be ignored; I'm asking how urgent they are. Obviously, since more such errors have been logged since I originally asked the question, that does increase the urgency somewhat, but I am still curious what to make of the fact that the errors seem to consistently occur at the exact same memory address. – user Sep 04 '17 at 07:47
  • 1
    If an address is consistently reported, I have to assume there's a fault at that address. ECC can correct a single bit error. If this error gets worse then the error could become uncorrectable any time. – Tim Sep 04 '17 at 08:44
0

I'd suggest to run memtest86+

http://www.memtest.org

It's also included in some distributions as standard package.

It may confirm your suspicion on faulty memory module.

Jaroslav Kucera
  • 1,435
  • 10
  • 16
  • I already know that there is a (as of yet correctable) problem with the RAM -- the system is flat out saying so in the logs, repeatedly even -- so I don't really need to run a testing tool (memtest86+ or otherwise) to confirm that. Doing so might be a good idea if the system is behaving *erratically*, but the whole point of having ECC RAM in the first place is to ensure that the system can gracefully handle memory errors. – user Sep 04 '17 at 07:49
  • Is it some HP/HPE system? In that case hpasmcli -s "show dimm" may signal you which module is wrong too. Generally, when the error happens on the same address repeatedly, I'd consider the module in the near fault state. If it would be in different memory locations and not so often, that can happen. Especially with high RAM sizes. However this doesn't seem to be the case. – Jaroslav Kucera Sep 04 '17 at 14:52
  • From your output I guess module 2 of channel 0 of the first socket is the culprit. Check your RAM layout in the motherboard documetation. – Jaroslav Kucera Sep 04 '17 at 15:03
0

I woke up this morning to what's a first for me; one of my systems had logged DRAM ECC error notifications. Three of them, in fact, for ... I wouldn't be too worried about a single correctable ECC error. The almost exactly ten minutes (down to a few microseconds, in fact) in between the errors being logged could be simply for RAM scrubbing happening every ten minutes; unfortunately, on this particular system, the scrub interval is not exposed as a setting.

Wikipedia's webpage on Memory Scrubbing says:

"Over 8% of DIMM modules experience at least one correctable error per year. This can be a problem for DRAM and SRAM based memories. The probability of a soft error at any individual memory bit is very small.".

"In order to not disturb regular memory requests from the CPU and thus prevent decreasing performance, scrubbing is usually only done during idle periods. As the scrubbing consists of normal read and write operations, it may increase power consumption for the memory compared to non-scrubbing operation. Therefore, scrubbing is not performed continuously but periodically. For many servers, the scrub period can be configured in the BIOS setup program.

That webpage contains a link to the SuperMicro X9SRA motherboard manual which explains the scrubbing interval:

"Patrol Scrub
Patrol Scrubbing is a process that allows the CPU to correct correctable memory errors detected on a memory module and send the correction to the requestor (the original source). When this item is set to Enabled, the North Bridge will read and write back one cache line every 16K cycles, if there is no delay caused by internal processing. By using this method, roughly 64 GB of memory behind the North Bridge will be scrubbed every day. The options are Enabled and Disabled.".

Thus, the cause is not from scrubbing. It's possible that there is a faulty bit. While a fault might occur suddenly it seems odd that it goes away and comes back, especially when it occurs so frequently.

"How seriously should I take this? What's a good response; order replacement RAM right away and schedule to install it ASAP, treat this as just a momentary glitch, or be on toes to replace RAM if it happens again but no specific action right now?"

Pavel Machek, whom invented the nohammer kernel module says:

"It is fairly hard to do rowhammer by accident, so if you are hitting it, someone is probably doing it on purpose. ... Well, there's more than three orders of magnitude difference between cosmic rays and rowhammer. IIRC cosmic rays are expected to cause 2 bit flips a year... rowhammer can do bitflip in 10 minutes, and that is old version, not one of the optimized ones.".

You can exchange the RAM modules and see if the error report follows the chip, sticks with the memory location, or occurs elsewhere.

HPE recommends (for a faulty memory module):

"SYMPTOM: The below error message is found in the OS logs:

host1 kernel: Northbridge Error (node X): DRAM ECC error detected on the NB.

FIX:
1. Identify the Memory module number that has failed (if mentioned in the error)
2. Check IML for Error relating to Memory module. Ex Proc x slot x
3. Update System BIOS
4. If no errors are found run diagnostics and replace the memory module (5-6 loops of Memory Diagnostics to isolate the memory module)"

Suggested course of action:

  • Switching RAM in it's sockets will tell you if it's a specific RAM module or if the fault is in other circuitry.

  • As long as you don't get more than one bit error every few days there's no panic (rush).

  • If you're getting hit every 10 minutes you might be getting hammered.

See also: "Defending against RowHammer in the kernel" and "ECCploit: ECC Memory Vulnerable to Rowhammer Attacks After All". For ARM processors there's: "Android GuardION patches to mitigate DMA-based Rowhammer attacks on ARM".

Rob
  • 320
  • 1
  • 3
  • 9