2

I'm a Computer Engineering student working on a project with a Verari blade cluster, a bit outdated for today standards. I had acquired some Unix experience but I'm not an expert at all.

This Verari cluster has 30 working blade nodes, 20 with two dual core AMD cpus (Opteron 250), 4Gb DDR ram and two IDE HDDs of 250gb. The other 10 node blades have two quad core Opteron cpus and 8Gb ram, with the same IDE HDDs. Those 30 nodes are attached to a patch panel that ends on two gigabit switches, connected each other with two cat-6 cables and bonding enabled on both switches. Also, I have an IBM Workstation that hosts a DNS, DHCP, HTTP, LDAP, PXE/TFTP and a FOG server for my domain.

My mission is to install a beowulf cluster with this hardware. It will be used for MPI programs, scientific calculations and geological simulations. My initial plan is to use CentOS 6.5 with a good kickstart file to facilitate deployment with a software RAID 1 setup on each node, central user authentication with an OpenLDAP server, OpenMPI software and SLURM resources manager.

Since I don't have a central storage to use yet, I have to look for a way to keep user home directories accessible for each compute node, with a minimal performance overhead and ensuring a bit of redundancy if things go wrong (this is a 2004~2006 hardware and is more susceptible to fail). What I have thought of is to use automounted NFS shares, with each compute node exporting a /home folder and the homeDirectory path stored on the user ldap account. This ends in up to 30 NFS server on a gigabyte link, mixing storage nodes with compute nodes, not a good practice but is what I got. Remember that these are IDE HDDs, so we have the good old writing and reading bottleneck there.

Other idea that comes through my mind is to use a distributed file system, again mixing compute nodes with storage nodes. I have red of GlusterFS, Ceph, AFS, PVFS2, OrangeFS and Lustre. For what I need, I think Lustre is the way to go, but it's meant for being on a group of NAS/SAN servers attached to the compute nodes with Infiniband, Myrinet or other high speed and low latency link. To use Lustre on my infrastructure, I will need to have a central node for MDT and MDS and the other 29 nodes as OST/compute nodes. I can recover in case of failure with both options, but I don't know how Lustre will scale with over 30 nodes acting as storage and compute units at the same time.

Anybody has a better idea of what to use in my project? Any experiences or feedback with similar setups?

Thanks in advance for your answers.

archector
  • 29
  • 1
  • Computing is the main purpose and redundant home-dirs with same content are tertiary goals? Why do you bother about home-dir performance? – Nils Apr 19 '14 at 17:56
  • 1
    I haven't heard the term "Beowulf Cluster" since *Slashdot* was my browser's home page! – ewwhite Apr 19 '14 at 19:06
  • @ewwhite This is a professional academic cluster setup. (not targeted towards you:) I see no reason to classify this as off-topic. – Nils Apr 21 '14 at 10:56
  • @Nils I agree. I put a reopen vote on it, but am not holding my breath. Even if it's not off topic... it seems too broad or opinion-based to me. But we shall see. – HopelessN00b Apr 21 '14 at 12:59
  • I was asking for professional assistance, I don't have any experience with distributed file systems implemented in this kind of setup. This is not a personal project, I'm working as student for a University in a Intership looking forward to graduate and this really slows me back. I'll try to rewrite my question. – archector Apr 21 '14 at 15:48
  • We don't cater to academia unfortunately. The premise of SF is *"for professionals, by professionals"*. Your question may be appropriate for other SE sites, but it is not appropriate for this particular one. – Andrew B Apr 22 '14 at 07:29

1 Answers1

-1

My use for clusters has ever been ha as primary and speed as secondary goal.

I found that a very conservative approach can fulfill both goals if we are speaking about less than 1000 concurrent users.

For the home-dirs I would go for a simple nfs-based two-node active/passive cluster with an even number of shares distributed between the two nodes in primary/secondary drbd role.

Nils
  • 7,657
  • 3
  • 31
  • 71