What is a non-critical server? One that can fail?
ECC RAM is fundamental when memory reliability is fundamental.
Two things grow with the growth of memory sizes:
- the reliance of software on memory, esp. server software (take e.g. caching)
- the probability of memory error (p = num_bits * p_bit_failure)
This intel presentation on ECC reports these facts:
- Average rate of memory error for a server with 4GB memory running 24x7 is 150 times a year
- ~4000 correctable errors per memory module per year
- Overclocking and system age greatly increase failure rates
- Recurrent failures are common and happen quickly (97% occur within 10 days of first failure) => avalanche effect
- For an ECC server with lifespan of 3 to 5 years, chance for system failure uncorrectable memory error is less than 0.001%
Another recent research by WISC shows ECC to be essential for these ZFS systems:
ZFS has no precautions for memory corruptions: bad data blocks are returned to the user or written to disk, file system operations fail, and many times the whole system crashes.
It is important to note that other filesystems are just as sensitive to this form of data corruption as ZFS is.
ECC is what saves you from running into these problems, when possible, and in disastrous cases, what warns you about this happening before it's too late.