4

I am a user on a cluster using NFS for our data storage needs. Recently, I have been running a pipeline which has very high I/O during some operations.

The program we think is causing the problem is named Bowtie, an aligner in Bioinformatic pipelines. In short we have alphabetic sequences in fragmented files of 1 million lines per file which are compared to another file containing the entire dictionary. (This is an oversimplification of the algorithm)

This Dictionary is memory mapped during the procedure. I have queue submission rights to three nodes on the cluster.

Nodes: Node1 Node2 Node3 Node4 Node5 Node6 Node7

My Right: Node1 Node2 Node3

Number of Processors available to me: 128 processors or 128 running queue slots.

For running on the cluster, the main file is divided into Chunks of 1 million lines each and then all jobs are started using SGE.

The Dictionary at this point is loaded in memory onto each node i.e. Node1 2 and 3

For each job active on the queue slot, I have the following file handlers open

1 Job file containing the code to run 1 code file containing the exit code of the process 1 SGE generated STDOUT file 1 SGE generated STDERR file 1 File Chunk 1 Output File

Meaning that during this process I have 768+3 file handlers open on the remote data storage, although the first four files are constant for every single script in the pipeline.

Whenever this happens the NFS server on the data storage crashes and our entire cluster goes down because the storage becomes non-responsive.

Our IT personnel have suggested that this may be due to high I/O during this process and possibly NFS was only ever meant for small clusters not large scale ones.

Therefore, we have worked around it to a solution where we are planning to run this process on one of the Nodes itself. But then the point of having a cluster at our disposal is negated because we would be writing onto the Node's disk and not the data storage shared across all clusters.

I fail to believe that NFS was built for small scale clusters and has never been successfully implemented on large enterprise scale solutions. May another reason exist for NFS suddenly dropping the network connection?

I am certain the process is question is the cause for the cluster freeze, but I am not convinced that the read/write speed it demands is the cause of the fail. Have any of you experienced such an issue previously? Is a complete protocol migration the only solution which we have?

  • Can you make any non-vendor specific tests (for io performance)? Something like: `dd if=/dev/null of=/{path}/file.im bs=10M count=100`. – neutrinus May 19 '15 at 11:09
  • It's common to copy some files to/from a node when a job starts/ends. If other nodes don't care about a given output file until a job is done there isn't much point to having it on shared storage until the job is done. Same for any small static files or log files - copy them locally / remotely when needed to conserve network handles/IO/etc. – Brian May 19 '15 at 11:27
  • Sorry about the delay neutrinus, but we experienced another cluster freeze so the output to dd 0+0 records in 0+0 records out 0 bytes (0 B) copied, 0.000442196 s, 0.0 kB/s – FoldedChromatin May 19 '15 at 11:47
  • I believe that neutrinus meant dd if=/dev/zero of ... Copying from black hole has no sense here. – Michal Sokolowski May 19 '15 at 11:50
  • Hi Brian, true it doesn't make sense to have them on the shared storage from the start, but our nodes do not have any storage to call their own, maybe a few hundred gigabytes for the distro. Also how does the SGE submit qsub requests at this point if it isn't using a shared storage area? How do I make sure that it goes to a certain node because of hardcoded paths while submitting qsub jobs? I believe there are certain protocols which will allow everything I'm saying, but our current one offers me and others greater flexibility, such as a shared local software dir to install local softwares. – FoldedChromatin May 19 '15 at 11:54
  • Hi Michal, As suggested I ran dd if=/dev/zero of..... the output being 100+0 records in 100+0 records out 1048576000 bytes (1.0 GB) copied, 18.0762 s, 58.0 MB/s – FoldedChromatin May 19 '15 at 11:59
  • If your NFS server crashes, then it has a bug.. report it to your vendor or switch NFS server implementations. – psusi May 28 '15 at 22:44

1 Answers1

1

A few suggestions learned over the years.

  1. Minimise the load on the NFS server:

set NFS export options: async,insecure,no_subtree_check

set NFS mount options soft,noatime,nodiratime,nolock,vers=3

also set: noatime,nodiratime on data/tmp/scratch mounts. Make sure the NFS encryption is off to reduce load. Stop NFS lock process.

  1. Try enabling the JUMBO frames for the network on all hosts (if supported by the net equipment) - set MTU to 9k or so.

  2. Make sure the raid10 storage is used (avoid raid5/6 at ALL costs) for random write IO. Any SSD's?

  3. Maximize the number of FS handles open (Default is 2K on some systems), set it to 1M or so.

  4. Any chance for copying the mapping database wih input data to the local scratch node storage, and than combining/sorting the resulting sam files as a separate step?

  5. Increase the size of the processed chunk (so it is being processed for at least 30 mins or more.

  6. If possible split jobs on a highest level possible (try mapping/sorting 10 separate genomes/samples on 10 different nodes in parallel, instead of trying to map each genome in series using 10 hosts). Attempt checkpointing, once all processes had finished.

  7. Modify a program source, so it reads data in a larger chunks - like 1M instead of 4k.

  8. Be aware of the CPU/RAM interconnect contention (esp on AMD 4-8 socket systems), sometimes running 12-24 threads on 48 core box is way faster than 48 threads. Try different utilization levels. Make sure the NUMA is on& configured for multi CPU systems. Recompile with NUMA enabled.

PS: Managing an efficient cluster is similar to planning/managing a building site with 1k+ workers...

Mark
  • 54
  • 3