11

Are ECC memory modules important to have on a non-critical server?

I was thinking about getting myself a toy dedicated server for lots of random, non-critical stuff. Sporadic reboots are no big deal. I'm looking at one provider but the prices are insanely cheap. Their hardware sounds like a joke for any any serious server box: desktop processors, non-ECC RAM, no-name chassis, no hotswap SATA HDD, etc. (well, the price justifies it, I guess).

I take ECC memory for granted on any "serious" server, so I'm wondering if it's a big deal or not for "toy" appliances.

masegaloeh
  • 17,978
  • 9
  • 56
  • 104
PJK
  • 221
  • 2
  • 5
  • 3
    You question ECC memory yet appear happy to use SATA drives. Very strange. – John Gardeniers Feb 05 '12 at 20:52
  • 3
    @JohnGardeniers You see, even if that means a dead HDD once a year, I don't mind few hours of downtime and raid recovery. But having daily/weekly trouble would be annoying. Yes, I'm actually more concerned over my leisure than my uptime in this case... – PJK Feb 05 '12 at 21:02
  • 6
    @JohnGardeniers: SATA drives aren't any more reliable than SCSI/SAS HDDs: http://www.usenix.org/event/fast07/tech/schroeder/schroeder.pdf – Hubert Kario Feb 06 '12 at 01:02

5 Answers5

11

Data published by CERN IT staff (Data Integrity) would suggest that the amount of errors that comes from RAM is quite low. You still have to weight your data and the cost of hardware.

You can read a bit more about this at StorageMojo.

Hubert Kario
  • 6,351
  • 6
  • 33
  • 65
10

ECC RAM basically helps to prevent errors that occur when reading and writing from RAM. The chance of there actually being an error is quite small, but non-zero. I would say that if you aren't doing mission-critical stuff you could get away without ECC RAM - like I said, the chances of encountering an error that ECC would prevent is really, really small.

BenGC
  • 1,775
  • 15
  • 26
6

What is a non-critical server? One that can fail?

ECC RAM is fundamental when memory reliability is fundamental.

Two things grow with the growth of memory sizes:

  • the reliance of software on memory, esp. server software (take e.g. caching)
  • the probability of memory error (p = num_bits * p_bit_failure)

This intel presentation on ECC reports these facts:

  • Average rate of memory error for a server with 4GB memory running 24x7 is 150 times a year
  • ~4000 correctable errors per memory module per year
  • Overclocking and system age greatly increase failure rates
  • Recurrent failures are common and happen quickly (97% occur within 10 days of first failure) => avalanche effect
  • For an ECC server with lifespan of 3 to 5 years, chance for system failure uncorrectable memory error is less than 0.001%

Another recent research by WISC shows ECC to be essential for these ZFS systems:

ZFS has no precautions for memory corruptions: bad data blocks are returned to the user or written to disk, file system operations fail, and many times the whole system crashes.

It is important to note that other filesystems are just as sensitive to this form of data corruption as ZFS is.

ECC is what saves you from running into these problems, when possible, and in disastrous cases, what warns you about this happening before it's too late.

michele
  • 575
  • 3
  • 7
1

It's simply not that important. If you needed 99.999% uptime you'd worry about it. Other than that you'll reboot more often than you'll get memory errors.

Jim B
  • 23,938
  • 4
  • 35
  • 58
1

This study by Google from 2009 found an error rate between 25000 and 70000 errors per billion device hours per megabit. That means for 8GiB of (used) RAM there were roughly 1.7 to 4.8 errors per hour.

Bitflips are something that exists and shouldn't be ignored as soon as data integrity is of importance.

In your case (random, non-critical stuff) it would propably be overkill.

bl4x1
  • 13
  • 5