I run a server which has just experienced an error I've not encountered before. It emitted a few beeps, rebooted, and got stuck at the startup screen (the part where the bios shows its logo and begins listing information) with the error:
Node0: DRAM uncorrectable ECC Error
Node1: HT Link SYNC Error
After a hard reset the system booted fine and has yet to report anything on edac-util.
My research tells me that even with ECC memory and a system in ideal conditions, an uncorrectable error is still possible and probably will likely occur during the lifespan of the system at some point; some reports suggest at least once a year or sooner.
The server runs CentOS 6.5 with several ECC modules. I am already in the process of trying to diagnose which module threw the error to make an assessment whether this is a fault or the result of something as unavoidable such as a cosmic ray.
My research also suggests that when the system halts like this, there is nowhere for a log to be written and that the only reliable way to do this is to have the system attached to another with the log being written out through a serial port.
Besides the usual edac-util, memtest, stress testing, and precautionary replacement, is there anything else I should take into consideration when addressing this error?
I was unable to find any record of this crash in any of the CentOS logs I searched, which goes along with my belief that it is not possible to log this error to a local disk. The error was only reported to me by the bios after an automatic reboot. Is it advisable to be writing system logs out to serial at all times to log these kinds of errors?
Is this kind of failure avoidable using a single system or is this only possible using an expensive enterprise solution?
What can I do to provide fallback measures in these failure cases for a single production server; as in, the production server itself does not span multiple machines but a fallback server can exist.