Evaluating uncorrectable ECC errors and fallback methods

Question

I run a server which has just experienced an error I've not encountered before. It emitted a few beeps, rebooted, and got stuck at the startup screen (the part where the bios shows its logo and begins listing information) with the error:

Node0: DRAM uncorrectable ECC Error

Node1: HT Link SYNC Error

After a hard reset the system booted fine and has yet to report anything on edac-util.

My research tells me that even with ECC memory and a system in ideal conditions, an uncorrectable error is still possible and probably will likely occur during the lifespan of the system at some point; some reports suggest at least once a year or sooner.

The server runs CentOS 6.5 with several ECC modules. I am already in the process of trying to diagnose which module threw the error to make an assessment whether this is a fault or the result of something as unavoidable such as a cosmic ray.

My research also suggests that when the system halts like this, there is nowhere for a log to be written and that the only reliable way to do this is to have the system attached to another with the log being written out through a serial port.

Besides the usual edac-util, memtest, stress testing, and precautionary replacement, is there anything else I should take into consideration when addressing this error?

I was unable to find any record of this crash in any of the CentOS logs I searched, which goes along with my belief that it is not possible to log this error to a local disk. The error was only reported to me by the bios after an automatic reboot. Is it advisable to be writing system logs out to serial at all times to log these kinds of errors?

Is this kind of failure avoidable using a single system or is this only possible using an expensive enterprise solution?

What can I do to provide fallback measures in these failure cases for a single production server; as in, the production server itself does not span multiple machines but a fallback server can exist.

Supermicro H8SGL-F motherboard, Opteron 6376, 64GB (16x4) Hitachi, 32GB (16x2) Viking. The server has been stable for over two years and only how has been having a problem. The system locked up again today with a blank screen. I'm still trying to diagnose which stick it is, as I don't have another one of these boards or another set of ram to swap in. — Zhro, Aug 26 '14 at 12:50
@Zhro Zhro, did you fix this issue? I also recently started having this issue. I was running Window 7 with no problems. I switched to Ubuntu 14.04 and I started getting the same error you got. And when I run Memtest, it passes 100%. — Jackson Hart, Sep 23 '14 at 15:02
I haven't been able to come up with a solution since I have only the one production system. I'm currently building a replacement (new mb/cpu/ram/psu) to expand capacity with a second processor; once I have it completed I will be performing must more rigorous testing on the old system and will report my findings in this thread. — Zhro, Sep 23 '14 at 19:11
Jackson, ssh in and leave "watch edac-util -v" running to see if it reports any soft ecc errors at runtime. Also see my answer below which addressed the out-of-spec layout I was using. Also, please share any of your own findings as you encounter them. — Zhro, Sep 23 '14 at 19:43

score 1 · Answer 1 · edited Apr 13 '17 at 12:14

Well, this isn't a fully-integrated system like an HP, Dell or IBM server, so the monitoring and reporting of such a failure isn't going to be present or consistent.

With the systems I've managed, disks fail the most often, followed by RAM, power supplies, fan, system boards and CPUs.

Memory can fail... There isn't much you can do about it.

See: Is it necessary to burn-in RAM for server-class hardware?

Since you can't really prevent ECC errors and RAM failure, just be prepared for it. Keep spares. Have physical access to your systems and maintain the warranty of your components. I definitely wouldn't introduce "precautionary replacement" into an environment. Some of this is a function of your hardware... Do you have IPMI? Sometimes hardware logs will end up there.

This is one of the value-adds of better server hardware. Here's a snippet from an HP ProLiant DL580 G4 server where the ECC threshold on the RAM was exceeded, then progressed to the DIMM being disabled... then finally the server crashing (ASR) and rebooting itself with the bad DIMM deactivated.

0004 Repaired       22:21  12/01/2008 22:21  12/01/2008 0001
LOG: Corrected Memory Error threshold exceeded (Slot 1, Memory Module 1)

0005 Repaired       20:41  12/06/2008 20:43  12/06/2008 0002
LOG: POST Error: 201-Memory Error Single-bit error occured during memory initialization, Board 1, DIMM 1. Bank containing DIMM(s) has been disabled.

0006 Repaired       21:37  12/06/2008 21:41  12/06/2008 0002
LOG: POST Error: 201-Memory Error Single-bit error occured during memory initialization, Board 1, DIMM 1. Bank containing DIMM(s) has been disabled.

0007 Repaired       02:58  12/07/2008 02:58  12/07/2008 0001
LOG: POST Error: 201-Memory Error Single-bit error occured during memory initialization, Board 1, DIMM 1. Bank containing DIMM(s) has been disabled.

0008 Repaired       19:31  12/08/2009 19:31  12/08/2009 0001
LOG: ASR Detected by System ROM

score 1 · Answer 2 · answered Aug 27 '14 at 18:56

If the DIMM has uncorrectable error I'd recommend replacing it. If it is only correctable errors in a low rate you can probably live with it and in any case for correctable errors it will be harder to get a refund.

If you want to see if there is a record try to access the IPMI SEL records, with ipmitool sel elist or an equivalent tool.

The other alternative is to setup a Linux crash kernel to boot into and save the dmesg, this can also catch the information on the hardware failure.

The third alternative is to log the serial console of the server to somewhere persistent, it will also include the clues for a server crash of software or hardware kind.

score 1 · Accepted Answer · answered Sep 23 '14 at 19:39

This is in answer share how I stopped the system from crashing but does not address the original question. I'm still researching solutions and will share any new information I come up with as I learn it.

The system is a white box with a Supermicro H8SGL-F motherboard with 64GB (16x4) Hynix, 32GB (16x2) Viking ram. The motherboard specification says that ram modules must be installed in sets of four as the processor uses quad-channel memory controller. I threw the extra two Viking modules in to see if it worked and it did. This solution worked for months but was my first mistake.

My second mistake was that I installed the ram incorrectly. The memory slots are color-coded and interleaved for the quad-channel setup. I had the ram installed like this:

[ ========== ] 16GB Hynix
[ ---------- ] 16GB Hynix
[ ========== ] 16GB Hynix
[ ---------- ] 16GB Hynix
[ ========== ] 16GB Viking
[ ---------- ] 16GB Viking
[ ========== ]
[ ---------- ]

While this setup did work for several months and only starting producing a problem recently, I would not determine whether the fault was due to increased capacity causing a problem with my out-of-spec layout of whether a module actually had an issue.

As I only had one production system, I removed all of the modules and started rotating them in as pairs of two (still out of spec) and running the system at reduced capacity for several weeks. I received no crashes and there were no reports of ecc errors from edac-util. However, it's possible that a faulty module may have been in the second slot and simply was not accessed such that it would cause a fault.

After rotating through the ram failed to reproduce the error, I realized that I had setup the ram incorrectly. I removed the Viking modules and setup the new layout like this:

[ ========== ] 16GB Hynix
[ ---------- ]
[ ========== ] 16GB Hynix
[ ---------- ]
[ ========== ] 16GB Hynix
[ ---------- ] 
[ ========== ] 16GB Hynix
[ ---------- ]

Since I made this change the system remained stable. Despite aligning to spec however, this does not confirm whether the fault is with the layout, a Viking module (since they were removed), or whether the offending module is simply one of the Hynix modules further down in the layout which is accessed infrequently enough not to fault.

Please see this answer not as a conclusion to the problem but a step I have taken to address the overall issue. I am not finished and will continue to report as I continue looking for solutions.

Also of note, the system power cycled yesterday for the first time since I set the memory to the new layout. I cannot confirm whether this was due to the memory issue being addressed or whether this is a separate issue with the power supply, so take this single incident thus far as a grain of salt.

Evaluating uncorrectable ECC errors and fallback methods

3 Answers3