I will be building a medium scale cluster (20 nodes, expanding later) and for various reasons, using commodity hardware should give me a significant cost saving (even allowing for shorter operational cycles / failures). My worry is about persistent memory faults.
The obvious solution here is to run memtest regularly on each node - but this poses 2 issues:
while memtest has a run-once then exit mode - how do I configure (in advance) what should happen after it exits (i.e. boot Linux)
the run-once mode simply halts if errors occur - how do I project that status out of the host?