I am a user on a cluster using NFS for our data storage needs. Recently, I have been running a pipeline which has very high I/O during some operations.
The program we think is causing the problem is named Bowtie, an aligner in Bioinformatic pipelines. In short we have alphabetic sequences in fragmented files of 1 million lines per file which are compared to another file containing the entire dictionary. (This is an oversimplification of the algorithm)
This Dictionary is memory mapped during the procedure. I have queue submission rights to three nodes on the cluster.
Nodes: Node1 Node2 Node3 Node4 Node5 Node6 Node7
My Right: Node1 Node2 Node3
Number of Processors available to me: 128 processors or 128 running queue slots.
For running on the cluster, the main file is divided into Chunks of 1 million lines each and then all jobs are started using SGE.
The Dictionary at this point is loaded in memory onto each node i.e. Node1 2 and 3
For each job active on the queue slot, I have the following file handlers open
1 Job file containing the code to run 1 code file containing the exit code of the process 1 SGE generated STDOUT file 1 SGE generated STDERR file 1 File Chunk 1 Output File
Meaning that during this process I have 768+3 file handlers open on the remote data storage, although the first four files are constant for every single script in the pipeline.
Whenever this happens the NFS server on the data storage crashes and our entire cluster goes down because the storage becomes non-responsive.
Our IT personnel have suggested that this may be due to high I/O during this process and possibly NFS was only ever meant for small clusters not large scale ones.
Therefore, we have worked around it to a solution where we are planning to run this process on one of the Nodes itself. But then the point of having a cluster at our disposal is negated because we would be writing onto the Node's disk and not the data storage shared across all clusters.
I fail to believe that NFS was built for small scale clusters and has never been successfully implemented on large enterprise scale solutions. May another reason exist for NFS suddenly dropping the network connection?
I am certain the process is question is the cause for the cluster freeze, but I am not convinced that the read/write speed it demands is the cause of the fail. Have any of you experienced such an issue previously? Is a complete protocol migration the only solution which we have?