7

We have a small cluster of six ubuntu servers. We run bioinformatics analyses on these clusters. Each analysis takes about 24 hours to complete, each core i7 server can handle 2 at a time, takes as input about 5GB data and outputs about 10-25GB of data. We run dozens of these a week. The software is a hodgepodge of custom perl scripts and 3rd party sequence alignment software written in C/C++.

Currently, files are served from two of the compute nodes (yes, we're using compute nodes as file servers)-- each node has 5 1TB sata drives mounted separately (no raid) and is pooled via glusterfs 2.0.1. They each have as 3 bonded intel ethernet pci gigabit ethernet cards, attached to a d-link DGS-1224T switch ($300 24 port consumer-level). We are not currently using jumbo frames (not sure why, actually). The two file-serving compute nodes are then mirrored via glusterfs.

Each of the four other nodes mounts the files via glusterfs.

The files are all large (4gb+), and are stored as bare files (no database/etc) if that matters.

As you can imagine, this is a bit of a mess that grew organically without forethought and we want to improve it now that we're running out of space. Our analyses are I/O intensive and it is a bottle neck-- we're only getting 140mB/sec between the two fileservers, maybe 50mb/sec from the clients (which only have single NICs). We have a flexible budget which I can probably get up $5k or so.

How should we spend our budget?

We need at least 10TB of storage fast enough to serve all nodes. How fast/big does the cpu/memory of such a file server have to be? Should we use NFS, ATA over Ethernet, iSCSI, Glusterfs, or something else? Should we buy two or more servers and create some sort of storage cluster, or is 1 server enough for such a small number of nodes? Should we invest in faster NICs (say, PCI-express cards with multiple connectors)? The switch? Should we use raid, if so, hardware or software? and which raid (5, 6, 10, etc)?

Any ideas appreciated. We're biologists, not IT gurus.

cespinoza
  • 303
  • 1
  • 7

5 Answers5

10

I'm in the field of computer science and I do research in bioinformatics. Currently 746 on Biostars :)

I have been operating the bioinformatics compute facilities for 3 years at a university (about 40 Linux servers, 300 CPUs, 100TB disk space + backups, about 1T RAM total - servers ranging 16 to 256GB of RAM). Our cluster has 32 8-core compute nodes, 2 head nodes, and we are expanding it with 2 more 48-core compute node. We serve the files to the compute nodes over NFS.

I would recommend switching to NFS for your situation.

We considered switching to Gluster, Lustre, and Samba but decided not to use those.

NFS

I have a few main tips about NFS:

  1. Have a dedicated NFS server. Give it 4 cores and 16GB RAM. A dedicated server is more secure and easier to maintain. It's a much more stable setup. For example, sometimes you need to reboot the NFS server - a dedicated server will not fail your disk accessing computations - they will simply freeze and proceed once NFS server is back.
  2. Serve to your compute and head nodes only. No workstations. No public network.
  3. Use NFS version 3. From my experience NFSv4 was more fragile - more crashes - harder to debug. We switched the cluster from NFSv3 to NFSv4 and back several times before settling. It's a local network so you don't need the security (integrity and/or privacy) of NFSv4.

Storage Hardware

Our current cluster was bought 3 years ago so it's not using SAS, but rather has an expansive FiberChannel drives and san controllers. This is changing, all the new storage that we are buying is SAS.

I would suggest considering a SAS storage. SAS is replacing FiberChannel as a cheaper, faster and a better solution. Recently I did research on the different solutions offered. Conveniently the options that we looked at are documented of Server Fault: What are SAS external storage options (Promise, Infortrend, SuperMircro, ...)?

We recently ordered a 24TB 6Gb SAS - 6Gb SAS storage system from RAID Incorporated. Just for the storage we payed $12k. The order should come in a couple of weeks. This is a no-single-point-of-failure system - all components are redundant and automatically fail over if any components fail. It's attached to 2 servers each using a different partition of the array. It is a turn-key solution so once it's shipped we just need to connect it, power it on, and it will work (RAID6 partitions will be mounted on Linux). The order also included servers and RAID Incorporated are setting-up Linux Debian on those for no extra cost.

Other considerations

Unfortunately, if you do bioinformatics infrastructure operations you probably need to become a storage guru.

For your 10TB partition, pick RAID6 - 2 drives can fail without losing you data. Rebuilding a 2TB drive onto a hot spare takes 24 hours, another drives can fail during that time. I had 2 drives fail simultaneously in a 16 drive array.

Consider dedicating one drive to be a hot spare in the array. When you have more then 16 drives then I would say a hot spare is a must.

Think of a plan of action if hardware fails on the dedicated NFS server. I would keep a twin as a compute node as a potential replacement for the original NFS server.

Finally, I have to mention our file server is running OpenSolaris (sounds unusual - I know). OpenSolaris (as it turned out for us) has excellent server hardware support (FiberChannel, IniniBand, ...). Setting up an NFS server ground up takes 1 hour - all steps a completely straight forward: install os, update through a NAT, setup network, create a zfs pool, create zfs filesystems, share NFS. Sun were the ones who developed NFS in 1984, not surprisingly OpenSolaris is very good at serving NFS. The main reason to use OpenSolaris was ZFS - a good filesystem for bioinformatics. Some features that I like:

  • Integrity (all writes are checksumed)
  • Pooled storage, snapshots
  • NFS exports are configure in the served filesystem
  • Online compression
  • Reservations (space guarantees)
  • Block level Deduplication
  • Efficient backups (see zfs send).

Using Linux for your NFS server would be fine - in that case stick to XFS or Ext4.

Aleksandr Levchuk
  • 2,415
  • 3
  • 21
  • 41
  • Thanks for the helpful answer - can you elaborate on why a dedicated NFS server is preferrable over a distributed cluster file system? – Stefan Seemayer Feb 20 '14 at 13:28
2

Your budget isn't going to get you very far with SAN class hardware but you should be able to get much better performance by beefing up the hardware you have. Get a decent RAID controller, buy more disks, get a much better switch and maybe a good multi port NIC (get decent server grade ones, like the Intel PRO 1000 GT or ET's).

If your description of the IO pattern is correct you have a 15:85 Read/Write ratio so you will need to go for RAID 10 in order to improve on your throughput numbers with SATA disks. Given your write bias if you were to simply reconfigure your current drives for RAID-5 (or RAID6 which would be more advisable at this scale) performance would plummet. RAID-10 will halve the usable capacity of the disks though.

Getting all of the above, and enough disks to deliver 10TB in RAID10 for $5k is doable, but it's not a risk free exercise. There are some very interesting options described in this question and its answers that are worth considering if you are happy with the risks and comfortable building your own solution.

However my main advice to you would be to start asking yourself (or whoever signs the checks) is how much a storage failure will actually cost your business and whether you are comfortable with that risk. Your budget of $5k may just about allow you to improve performance but you're talking about having 10TB of what I assume is business critical data and processing capacity all riding on an infrastructure with many single points of failure. Now might be a good time to take a long hard look at just how important this infrastructure is and figuring out if you can get enough of a budget together to buy a proper entry level SAN or NAS solution.

Helvick
  • 19,579
  • 4
  • 37
  • 55
2

Are your processing tasks self-developed? Are they distributed by assigning each node some chunk of data to process?

If so, it might be more effective to bring the process closer to the data, not to serve the data to the processes. It's not too hard to do, but requires a different thinking process than just building servers.

First, put some drives on every node. Maybe not RAID, just a filesystem on each. Split the data on all disks on all nodes, and start the tasks on the nodes that hold the data needed for the task. Try to minimize inter-node transfers.

Of course, none of this would work if your tasks need unpredictable parts of the data.

Javier
  • 9,078
  • 2
  • 23
  • 24
1

Usually this kind of processing is about extracting infromation from data - but your output is orders of magnitude greater than the input?

First thing to look at is how is the data being used? Most genetic analysis and to a certain extent, protein folding using finite element analysis relies on sequential access to large data files - compared with random access. So latency is not as much as an issue as bandwidth off the disk.

So in terms of organising your disks, you probably want as many stripes across as many platters as possible - so RAID 5 or 6.

How you go about connecting this to the processing nodes depends a lot on your budget. If you've got lots of money, then setting up multiple virtual disks in a switched fabric SAN with the procesing nodes directly attached is the way to go.

For a cheap solution (i.e. at your budget), local storage in each processing node is the way to go. The important thing is that you keep your processing I/O off the network (but if necessary, use the network for copying data between nodes if no SAN available). And if you can map the data locally, then having lots of memory on the processing nodes will help with caching.

Certainly if you're on a very strict budget you want to get those local disks in a a RAID 5 setup. Also, if possible, buffer the output to the local disk while processing rather than writing directly back to the servers.

HTH

symcbean
  • 19,931
  • 1
  • 29
  • 49
1

I don't think you most likely don't want to go with ATAoE, or iScsi, or FC if you can avoid it. Those are all block storage technologies, and are better at providing disk space to individual servers from a common pool of disks. They are not designed to share that data easily among client machines, unless you run some special software for dealing with shared filesystems with metadata managers and such.
NFS is file based and designed to share filesystems among multiple servers for you and is free. Aleksandr is sending you in the right direction if what you need to do what Javier says, move the data to the processes to do the compute. If you want any job to be able to go to any node, then NFS is the way to go. Throughput will probably be better if you could pre-populate data to the nodes and send the jobs that need specific data to the nodes that have it. That's the hadoop,map/reduce way of doing it. For example, if you pre-loaded the mouse genome to one of the nodes, and when someone does a blast job against that genome, you send the job to the node that already has the data. No real data moved. However, that could create a bottleneck at that node, if the dataset that it has is popular and jobs could back up when other nodes are idle.

Some of the researchers I've been working with lately have gone with some "fat" nodes, or cluster-in-a-box. One bought a single 48core(4 12 core cpus) AMD based system, with 128gig of ram in it for about $15k. His algorithms are highly parallel, so greater core counts make sense for him. With that much memory, there is a ton of room for linux to use for file cache, so subsequent reads of multigig data files on that machine are super fast. Also, with the raid card he has, he gets about 300 megs per second to his local storage. I'm not saying this machine would work for everyone, but it works for him. Before we gave it to him to use, for fun I benchmarked a parallel bzip job on that machine, which compressed a 3gig text file down to 165meg, and it took about 4 seconds. (File was cached in ram). Quite zippy.

FYI, you're going to see what we used to call crazy load averages with high core count machines. Load averages of 20+ are quite common on this machine, and it's interactive performance is still pretty peppy.

Rob Taylor
  • 21
  • 1