Why am I seeing Zero errors in non-ECC RAM?

4

3

According to sources, memory errors are a very probable event:

  • Some say the probability of a DRAM error is 95% in just 3 days of operation of a computer with just 4 GB of RAM,
  • others say 32% of servers experience at least one error in a month with 8% of DIMMs being at fault.

Contrary to those horrors, in my more than 10 years of personal computers use I have seen exactly none of the memory errors.

I admit I never paid special attention to the subject. However, I have ventured multi-hour memtest86 runs couple of times and never seen an error either.

Some of the factors that IMO should aggravate the memory problems:

  • I build my computers out of the most "bulk commodity" parts: mainstream budget motherboards and the next to cheapest memory.
  • also I usually max out the technology available, e.g. in the times of 32 bit OS'es I used 4 GB of RAM and with the current desktop CPUs and the newer 64 bit OS'es I use 32 GB of RAM.
  • memory usage is moderately heavy with lots of virtual machines up running small and big tasks 24/7/365.

But nevertheless, no memory-related problems ever found!

How's that?

Alexander Shcheblikin

Posted 2013-11-07T21:37:10.310

Reputation: 620

Question was closed 2013-11-08T21:08:05.940

First if there was 95% failure rate on memory the industry would go out of business the source you quoted is simply wrong – Ramhound – 2013-11-07T22:29:09.140

2@Ramhound second…? – Synetech – 2013-11-07T22:29:42.730

@Synetech just saying if memory had 95% error rate it wouldn't be used. The paper is also 4 years old and old looks at DDR and DDR2 and thus because it was written in 2009 it's basically inaccurate because that's more or less 4 decades in technology time yes technology time is 10 faster the normal time. As to the final question it's simple the paper and blog post (shocker) isn't 100% accurate – Ramhound – 2013-11-07T22:33:35.087

How do you know you've seen no memory errors? You've never had a crash or freeze you weren't able to explain? – David Schwartz – 2013-11-07T22:37:22.717

@DavidSchwartz if your asking if I personal had a crash I couldn't explain that would be a negative I strive to understand every crash that happens I normally have a general idea of the reason – Ramhound – 2013-11-07T22:40:32.567

@Ramhound Then it sounds like there's a good chance you misdiagnosed some crashes or freezes that were due to memory errors. Or you just got absurdly lucky and never had a memory error hit a vital spot. – David Schwartz – 2013-11-07T22:42:07.750

1@Ramhound, you said “First…”, then nothing more. Was there a second point? – Synetech – 2013-11-07T22:53:44.300

Memory errors and memory failure are 2 different things. A flipped or missing bit here or there is generally OK in most applications that don't require ultra high fidelity / precision... As in most applications... – Austin T French – 2013-11-07T23:37:06.600

Answers

9

Blerg!

Darn, I tried to stuff this into a comment, but the formatting wasn’t sufficient, so I had to resort to putting it in an answer.

Statistics

The reason you have not seen it is because the odds of you seeing it are low, and more importantly, you were not looking. The odds of you noticing a memory error are calculated from the odds of:

  • a cosmic ray hitting the Earth
  • the ray hitting at your location
  • the ray not getting obstructed or absorbed by anything else
  • the ray hitting your computer
  • the ray hitting your RAM
  • the ray flipping a bit in the RAM
  • the bit being in a block of currently allocated memory
  • the used memory being either:
    • tested by a program like memtest86+
      • tested at just the right moment to detect the error (e.g., between the microsecond where the program writes the memory and then reads it back and compares)
    • allocated to a block of executable code, in which case also:
      • the changed bit significantly modifies the code enough to have a drastic effect on the code
      • the drastic effect causes it to crash
      • the crashing program actually crashes visibly instead of simply disappearing
      • the program being something that you notice and care about
      • you don’t simply discard it as a buggy program

Of course this is if we are talking about transient, intermittent errors like from cosmic rays and interference from other electronics. If the RAM module is actually defective, then you almost certainly will see problems at some point (though even then, it is conceivable that if you never use up all the physical RAM at any given time and the defect happens to be small and localized entirely in a part that never gets used by you, then you might not see an error).

The odds of a transient error can indeed be surprisingly high, but you probably have seen memory errors over the years and simply did not notice them because of two of the above list items: the executable code and the buggy-ignoring.

Examples

If the changed bit happens to fall in a piece of data, then you may not even notice it because it could easily get drowned out.

For example, if a bit got flipped in a block of text data, then you might notice that The end. turned into Tje end., but instead of noticing that the h had been replaced by a j because a single bit had gotten flipped (feel free to confirm if you like), you would more likely just assume that your finger hit the wrong key because they happen to be right next to each other and just fix the error.

Worse, if the flipped bit happened to be part of a picture, audio, or video file, you may not notice anything at all. If it just happened to be in just the right place, then it might cause a noticeable change like the width or height of the picture being wrong, or a slight popping sound in the song or a bit of corruption in the video causing a momentary blockiness during decoding. However, given the sheer size of media files, the chances of a single bit being in just the right location are extremely low. It is much more likely that it will slightly change the color of a single pixel (e.g., dark red to slightly darker red) and you would probably never notice. It might change a single peak of the song’s waveform so that it has a slightly lower amplitude and you would likely never notice. It might change a single pixel in a single frame of the video and you probably could not notice.

Caveat

The terrifying fact is that this sort of undetected, transient error can indeed creep in and go unnoticed. That is why I have been really concerned about using flash media for backups, because sometimes they get corrupt, and if you don’t notice, then the corruption could sneak into your backup and end up permanent. Moreover, testing for corruption can be difficult because changes are expected, so you would have to manually examine every single change which for binary files would be a nightmare.

Take away

I suppose the bright side, if there is one, is that as I said in the list, the change has to happen to land in a part of data that is actually important. For most people, the odds of it landing in a piece of important, irreplaceable data that is to be saved tends to be really low.

You can use a program like memtest to check your RAM for defects. If it passes muster, then you only have to worry about the “one-in-a-billion chances” (I’ll leave the exact calculation to someone else if desired) of a bit of important data getting corrupted, otherwise a bit of “bit-rot” here or there will usually not do much other than perhaps crash a program and cause you to swear at the devs (though even then, if it doesn’t do it again…)

Synetech

Posted 2013-11-07T21:37:10.310

Reputation: 63 242

This is an extremely informative, well-written, and well-formatted answer. – sudo – 2015-01-13T08:15:16.187

Good point about the memory needing to be "tested at just the right moment to detect the error"! Indeed, memtest86 doesn't seem to test for even moderately longer term data storage scenarios. – Alexander Shcheblikin – 2013-11-08T22:59:17.987

1

While early personal computers like the IBM PC included a parity bit to detect memory errors most modern systems don't. The result is the errors are not caught as memory errors and instead sometimes cause other problems like data corruption and odd crashes.

  • Memory with parity - detect errors
  • Memory with ECC - detect and correct errors
  • Memory with neither - errors go undetected

Brian

Posted 2013-11-07T21:37:10.310

Reputation: 8 439