Hardware failures while building new cluster

Someone at my company is building a high-performance cluster (50 CPU cores, half a dozen machines, 32 memory modules per machine). We aren't experienced with clusters at all, and we are concerned that it is taking far too long (more than 2 months). He talks about the time being down to hardware failure (several CPUs/memory modules failing) each time I contact him.

I am looking for some advice - is it normal for several CPUs and memory modules to fail in brand new clusters? Or is it likely down to human error?

hardware-failure
cluster

draguignan

Posted 2016-07-13T08:35:16.617

Reputation: 11

Are the machines all identical in terms of hardware? What interconnect are you using to make each node communicate? Ethernet? What software are you going to use to make all of these machines act as one large "supercomputer"? Is the hardware used or is it brand new? Are the machines connected to a LAN or are they on a separate network with no LAN connection? – Richie086 – 2016-07-25T15:21:54.750

Hardware failures while building new cluster

Answers