Hardware failures while building new cluster

0

Someone at my company is building a high-performance cluster (50 CPU cores, half a dozen machines, 32 memory modules per machine). We aren't experienced with clusters at all, and we are concerned that it is taking far too long (more than 2 months). He talks about the time being down to hardware failure (several CPUs/memory modules failing) each time I contact him.

I am looking for some advice - is it normal for several CPUs and memory modules to fail in brand new clusters? Or is it likely down to human error?

draguignan

Posted 2016-07-13T08:35:16.617

Reputation: 11

Are the machines all identical in terms of hardware? What interconnect are you using to make each node communicate? Ethernet? What software are you going to use to make all of these machines act as one large "supercomputer"? Is the hardware used or is it brand new? Are the machines connected to a LAN or are they on a separate network with no LAN connection? – Richie086 – 2016-07-25T15:21:54.750

Answers

0

CPUs almost never fail, and RAM fails fairly rarely. If there are problems with those two specific types of hardware, then the real issue is probably that the builder ran into unforeseen compatibility issues.

With that much memory (~192 modules of unknown size), it is conceivable that bit-flipping errors may start rearing their heads with alarming frequency. I hope the RAM is ECC, otherwise this may be the source of many delays and false starts.

Adam Wykes

Posted 2016-07-13T08:35:16.617

Reputation: 381