I've seen a dicussion about ECC ram use on servers. Why is it better?
-
2Question answered in another question: http://serverfault.com/questions/5817/would-you-use-ecc-ram-in-a-workstation – sh-beta May 07 '09 at 16:52
-
Is there any evidence that ECC memory is necessary or beneficial to use? The benefits and mechanism of action are easy to understand, but I've never heard evidence to justify its use. – Drew Stephens Aug 13 '09 at 15:25
-
And what are the various possible consequences from experiencing such memory (bit) errors ? For example, I have just switched-off a server that was online for 5 years non-stop (with ECC ram), and in the overall all went fine, I never had any complaints from clients that were hosted there or ever experienced a major fault with it... Same with my desktop computer experience - a BSOD here and there quite rarely, but is this all ? :) – Denis Volovik Nov 05 '11 at 21:02
-
@Denis, I think if you want people to answer your question you may need to ask it as a seperate question rather than a comment. – Toby Allen Nov 06 '11 at 17:58
4 Answers
Excellent real-world study:
DRAM Errors in the Wild: A Large-Scale Field Study (pdf)
This paper provides the first large-scale study of DRAM memory errors in the field. It is based on data collected from Google’s server fleet over a period of more than two years making up many millions of DIMM days. The DRAM in our study covers multiple vendors, DRAM densities and technologies (DDR1, DDR2, and FBDIMM).
The paper addresses the following questions: How com mon are memory errors in practice? What are their statistical properties? How are they affected by external factors, such as temperature, and system utilization? And how do they vary with chip-specific factors, such as chip density, memory technology and DIMM age?
We find that in many aspects DRAM errors in the field behave very differently than commonly assumed. For example, we observe DRAM error rates that are orders of magnitude higher than previously reported, with FIT rates (failures in time per billion device hours) of 25,000 to 70,000 per Mbit and more than 8% of DIMMs affected per year. We provide strong evidence that memory errors are dominated by hard errors, rather than soft errors, which most previous work focuses on. We find that, out of all the factors that impact a DIMM’s error behavior in the field, temperature has a surprisingly small effect. Finally, unlike commonly feared, we don’t observe any indication that per-DIMM error rates increase with newer generations of DIMMs.
Interesting that most memory errors were hard -- hard memory errors are unrecoverable, meaning the memory has to be physically replaced as failed, whereas soft memory errors can be fixed by overwriting the memory with the correct value. This indicates to me the value of ECC is fairly limited.
There are two kinds of errors that can typically occur in a memory system. The first is called a repeatable or hard error. In this situation, a piece of hardware is broken and will consistently return incorrect results. A bit may be stuck so that it always returns "0" for example, no matter what is written to it. Hard errors usually indicate loose memory modules, blown chips, motherboard defects or other physical problems. They are relatively easy to diagnose and correct because they are consistent and repeatable.
Sounds like all the servers in the study used ECC though, so we can't know ECC vs. non-ECC error rates..
This paper studied the incidence and characteristics of DRAM errors in a large fleet of commodity servers. Our study is based on data collected over more than 2 years and covers DIMMs of multiple vendors, generations, technologies, and capacities. All DIMMs were equipped with error correcting logic (ECC) to correct at least single bit errors.
- 12,994
- 20
- 74
- 92
-
5+1 nice report. While I don't *know* non-ECC error rates, I *estimate* that non-ECC error rates are roughly the same as ECC error rates per GB. The same RAM chips used are used in both ECC and non-ECC DIMMs (the ECC DIMMs simply uses 9/8 as many chips -- 72 raw memory bits to store a 64-bit data word, and 8/9 the error rate is roughly the same error rate), and I see no reason that a RAM chip would have a significantly different error rate when placed on a ECC DIMM vs. when placed on a non-ECC DIMM. – David Cary Mar 25 '11 at 21:40
-
Even if the errors are hard, you wouldn't be aware of it unless you had ECC ram right? – nhooyr Nov 09 '20 at 18:58
ECC RAM can recover from small errors in bits, by utilizing parity bits. Since servers are a shared resource where up-time and reliability are important, ECC RAM is generally used with only a modest difference in price. ECC RAM is also used in CAD/CAM workstations were small bit errors could cause calculation mistakes which become more significant problems when a design goes to manufacturing.
- 531
- 4
- 4
-
6A bit error in a number anywhere, including someone's small business finance package can be very small **or** very large. It all depends on what bit. – Zan Lynx Jul 18 '09 at 21:39
-
Add to that the fact that the wrong error in the wrong place could bring down a lot more than one machine when you've virtualized to consolidate. – MikeyB Nov 07 '11 at 02:47
-
1I'm just waiting for an unscrupulous company to claim their accounting fraud was actually just a bit error. – Eloff Jul 25 '13 at 16:30
ECC has several advantages over parity. For one, it can detect and repair single-bit errors and do so without having to stop the whole system. Multiple-bit errors will still return a parity error, but the odds of this happening are astronomically low during the lifetime of a PC unless the memory itself is defective. ECC is like auto insurance: It covers you for the majority of things that can go wrong, but it can't prevent a multi-car pileup.
more detail here: ECC memory: A must for servers, not for desktop PCs
- 357
- 1
- 3
- 7
-
1I disagree with the article. I think everyone should be using ECC. I wasn't going to give in but I wanted a new Core I7 enough that I finally did. However, I am sure my 6GB of RAM are picking up errors all over the place. – Zan Lynx Jul 18 '09 at 21:41
-
4@zan and these errors you are "sure" about, what consequence do they have? – Jeff Atwood Nov 07 '11 at 02:03
-
Don't be guessing; correctable errors ought to henerate MCEs which can be logged in the OS (System Log in Windows, /var/log/mcelog in Linux) – MikeyB Nov 07 '11 at 02:46
-
@JeffAtwood: Nothing usually, but I have had the occasional blue-screen for no apparent reason. On systems I have which *do* have ECC I will see a couple of single bit errors each month. – Zan Lynx Nov 07 '11 at 23:12
-
@JeffAtwood: And, like everyone I am sure, I've occasionally had to reinstall an application (Office. Visual Studio.) because it has apparently gone insane. App bug or ECC error causing a corrupt disk file? Who knows if you don't have ECC? – Zan Lynx Nov 07 '11 at 23:20
To make things simple, quoting from Wikipedia:
Electrical or magnetic interference inside a computer system can cause a single bit of DRAM to spontaneously flip to the opposite state. It was initially thought that this was mainly due to alpha particles emitted by contaminants in chip packaging material, but research [5] has shown that the majority of one-off ("soft") errors in DRAM chips occur as a result of background radiation
...
This problem can be mitigated by using DRAM modules that include extra memory bits and memory controllers that exploit these bits. These extra bits are used to record parity or to use an error-correcting code
- 5,713
- 27
- 29