0

I will be building a medium scale cluster (20 nodes, expanding later) and for various reasons, using commodity hardware should give me a significant cost saving (even allowing for shorter operational cycles / failures). My worry is about persistent memory faults.

The obvious solution here is to run memtest regularly on each node - but this poses 2 issues:

  • while memtest has a run-once then exit mode - how do I configure (in advance) what should happen after it exits (i.e. boot Linux)

  • the run-once mode simply halts if errors occur - how do I project that status out of the host?

symcbean
  • 19,931
  • 1
  • 29
  • 49

3 Answers3

1

Practical? Not regularly as a part of ongoing operations. Waiting for downtime to burn in memory won't detect transient bit flips. And introduces significant lag in detecting persistent failures. Further, if you mean the open source memtest86+, there are integration challenges like no UEFI support and automating the reporting of failures.

Instead, get hardware with sufficient RAS features, namely ECC memory. Then your server can report memory failures to you.

Such errors might not be very common. Servers without ECC won't immediately crash and burn, that is a choice. However, often the price premium is small, if there even is a choice for non-ECC RAM on your server model.

John Mahowald
  • 30,009
  • 1
  • 17
  • 32
  • Memtest seems to be working fine on the UEFI machine I am typing this on. And the cost benefit arises from the (apparent) price fixing applied to ECC capable (notably CPUs). – symcbean Feb 20 '20 at 14:08
  • To put it bluntly, AMD doesn't segment ECC as a feature like Intel does. I would be more convinced of the value of this with some insight into your costs, hardware and time. You are buying memtest86 licenses, inventing a continual memory burn in thing, and still are under some constraint to not get ECC? Ouch. – John Mahowald Feb 20 '20 at 20:54
0

I now have an answer to the first part of my question. The grub distribution includes something called grubonce. Hence if Linux is my default in grub, I can ask grub to run memtest once (and thereafter it will revert to the default).

So far it seems my only option for the second part is to look out for a machine staying offline (i.e. not running Linux) after a scheduled memtest is expected to complete.

symcbean
  • 19,931
  • 1
  • 29
  • 49
0

May I know what application do you run and what do you mean by persistent memory fault?

AFAIK a lot of today applications run really well on non-ECC RAM and most of the crash are not related to ECC issue but rather out-of-memory or bug.

And scanning the RAM to identify an error is very inefficient. The first place you could identify the potential error is from the log file, only if you found a symptom then you will have to run memtest.

I think it would be good to clarify your logic behind doing this first to identify a better solution, what do you think?

teclinux
  • 39
  • 4
  • persistent memory fault = broken and stays broken - obvs this is not the same as a random bit flip, however the research I've read suggests the latter is an indicator of an incipient form of the former (unless we are specifically talking about rowhammer). In the absence of ECC how does a system detect a memory fault let alone report on it? – symcbean Feb 20 '20 at 18:24
  • If it is forever broken then obviously your application will crash and then you only run once at that time. – teclinux Feb 21 '20 at 01:14
  • Trying to prevent before it happens is good but by doing an memtest it is very inefficient and introduce down time for every machine. Compare this to just putting one problematic machine offline, the former would definitely preferred. I have run simulation software (nearly 16 hours+ everyday) with non-ECC ram over 6 yrs with 10+ PC & don't have a single RAM issue. – teclinux Feb 21 '20 at 01:32