We have an Isilon cluster with 8 IQ 12000x nodes which exports storage via several NFS shares for a handful of Linux and Solaris clients.
There is a Linux system that has one of these NFS filesystems mounted. I/O to this filesystem is moderately heavy from the Linux system. Every 3-4 weeks (it's not on any kind of discernible schedule, and sometimes is more/less frequent than this), we notice that all activity ceases on this NFS mount (the process hangs, as if the network stopped working so process is stuck in uninterruptible sleep) - 30 minutes later, the share recovers and things continue to work normally. The kernel log from the affected machine is as follows:
Dec 3 10:07:29 redacted kernel: [8710020.871993] nfs: server nfs-redacted not responding, still trying
Dec 3 10:37:17 redacted kernel: [8711805.966130] nfs: server nfs-redacted OK
relevant /etc/fstab
line:
nfs-redacted:/ifs/nfs/export_data/shared/...redacted... /data nfs defaults 0 0
I've checked to see if there are any scheduled processes e.g. cron jobs, Isilon related functions e.g. snapshots, etc that might be causing these hangups but I can't seem to find anything. I'm also not aware of any network related issues or maintenance that would cause this. All of the lockups last almost exactly 30 minutes per the kernel logs.
Perhaps someone has some suggestions I could try? (I considered a soft mount to avoid the problems associated with processes accessing the filesystem hanging; however am wary of the corruption that could result and it would not really solve the underlying issue anyway).