Intermittent NFS lockups on Isilon cluster

Question

We have an Isilon cluster with 8 IQ 12000x nodes which exports storage via several NFS shares for a handful of Linux and Solaris clients.

There is a Linux system that has one of these NFS filesystems mounted. I/O to this filesystem is moderately heavy from the Linux system. Every 3-4 weeks (it's not on any kind of discernible schedule, and sometimes is more/less frequent than this), we notice that all activity ceases on this NFS mount (the process hangs, as if the network stopped working so process is stuck in uninterruptible sleep) - 30 minutes later, the share recovers and things continue to work normally. The kernel log from the affected machine is as follows:

Dec  3 10:07:29 redacted kernel: [8710020.871993] nfs: server nfs-redacted not responding, still trying
Dec  3 10:37:17 redacted kernel: [8711805.966130] nfs: server nfs-redacted OK

relevant /etc/fstab line:

nfs-redacted:/ifs/nfs/export_data/shared/...redacted... /data nfs defaults 0 0

I've checked to see if there are any scheduled processes e.g. cron jobs, Isilon related functions e.g. snapshots, etc that might be causing these hangups but I can't seem to find anything. I'm also not aware of any network related issues or maintenance that would cause this. All of the lockups last almost exactly 30 minutes per the kernel logs.

Perhaps someone has some suggestions I could try? (I considered a soft mount to avoid the problems associated with processes accessing the filesystem hanging; however am wary of the corruption that could result and it would not really solve the underlying issue anyway).

Is `nfs-redacted` in the Linux log an IP address of the Isilon node or a SmartConnect hostname? Could you log onto isilon nodes to check for anything suspicious in the logs? Most importantly, did you get any response from Isilon support? — Dmitri Chubarov, Dec 04 '12 at 07:15
Does altering your mount settings help? `rsize=32768,wsize=32768,timeo=14,intr` may be of assistance for you — Nick, Sep 10 '13 at 14:25
Along the lines of what @MaQleod was asking, what is `RPCNFSDCOUNT` set to? If redhat/centos, check `/etc/sysconfig/nfs`. If undefined, you may want to set that higher and check syslog periodically for max open files or file descriptor messages. [Here](https://tinyvpn.org/7/5/7/75773b4ad4d7974bb4ce811fc0cc50eb.txt) are settings I use. I also increase the max open files in `/etc/sysctl.conf` and `/etc/security/limits.conf` for the `root` user explicitly. Also look at nfsstats and rpcinfo. — Aaron, Aug 02 '16 at 21:43

score 1 · Answer 1 · answered Aug 05 '15 at 22:56

1

Check your MTU values are correct all the way through your wiring topology. If your access layer client is set to 9000, and it's going through a switch. Make sure the switch can handle larger MTU sizes.

answered Aug 05 '15 at 22:56

Mike Taylor

11
1

Intermittent NFS lockups on Isilon cluster

1 Answers1