The one thing to note is that when the NFS server is out - your mounts will freeze - doing a soft mount will not block so the "freeze" itself can be avoided, however that will not fix the problem of home directories as without a home directory, the user is screwed anyway.
Even when the NFS server recovers, unless you do something about it, the freeze problem will remain - you'll have to kill the process on the mounting machine, and remount. The reason for this is that when the NFS server comes back up, it assigned a different fsid
- so you can at least fix this problem by hard-coding the fsid
s on the NFS server, for example...
#. Home Directories
/usr/users \
192.168.16.0/22(rw,sync,no_root_squash,fsid=1) \
192.168.80.0/22(rw,sync,no_root_squash,fsid=1)
#. Scratch Space
/var/ftp/scratch \
192.168.16.0/22(rw,async,no_root_squash,fsid=3) \
192.168.80.0/22(rw,async,no_root_squash,fsid=3) \
172.28.24.151(rw,async,root_squash,fsid=3)
The exports(5)
man page states...
fsid=num
This option forces the filesystem identification portion of the file handle
and file attributes used on the wire to be num instead of a number derived
from the major and minor number of the block device on which the filesystem
is mounted. Any 32 bit number can be used, but it must be unique amongst
all the exported filesystems.
This can be useful for NFS failover, to ensure that both servers of the
failover pair use the same NFS file handles for the shared filesystem thus
avoiding stale file handles after failover.
...While that indicates that as long as the major/minor numbers do not change (which they usually don't, except for when you're exporting SAN/multipath volumes, where the may change), I've found that we've completely removed the problem - i.e., if the NFS server comes back - the connection has been restored quickly - I still really don't know why this has made a difference for devices such as /dev/sdaX
for example.
I should now point out that my argument is largely anecdotal - it doesn't actually make sense why it has fixed the problem, but it "seems" to have fixed it - somehow - there are probably other variables at play here that I've not yet discovered. =)