3

Since this afternoon something is wrong with the server. On the server side I see messages in dmesg as follows:

statd: server rpc.statd not responding, timed out
lockd: cannot unmonitor <client>
statd: server rpc.statd not responding, timed out
lockd: cannot monitor <client>

On the client side I see in dmesg:

lockd: server <server> not responding, still trying
lockd: server <server> OK

This is paralysing the entire network! I have tried this solution suggested by Xian, but it makes no difference.

Server, Debian Linux, Squeeze 64-bit:

>> uname -a
Linux <server> 2.6.32-5-amd64 #1 SMP Fri May 10 08:43:19 UTC 2013 x86_64 GNU/Linux

Clients, Linux Mint 13-64bit:

>> uname -a
Linux <client> 3.2.0-49-generic #75-Ubuntu SMP Tue Jun 18 17:39:32 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

I have not run an update on the server, so I don't know what could have changed. I did upgrade one of our client machines, but can't see why that would mess with the server, since all machines seem affected. Any ideas on how to fix this?

UPDATE 1

The server stalls for a while at

Starting portmap deamon
Starting NFS common utilities: statd idmapd

This takes about 2 minutes until boot continues...

UPDATE 2

It is indeed the client machine that was upgraded that caused this. It seems it somehow stalled statd on the server, causing all other machines to have issues. I rebooted the entire network, leaving that one machine off and I did not encountered any problems. Not really a fix, but I have since downgraded that machine again, and everything seems to be stable.

ewwhite
  • 194,921
  • 91
  • 434
  • 799
Markus
  • 200
  • 1
  • 2
  • 13
  • 1
    Are you sure the physical network is okay? Switches? Cables? – David Schwartz Jul 31 '13 at 00:03
  • What does the server use for storage, is it mounting an external nfs mount, or is it serving nfs from local storage? check your available devices on the server with `df`, if it hangs your storage is probably offline / dead / unplugged. – user16081-JoeT Jul 31 '13 at 00:16
  • @user16081 The storage is internal raid drive. `df` comes back normal. – Markus Jul 31 '13 at 00:23
  • @DavidSchwartz I believe the network is ok. I can still log into the server remotely, without disconnects. Not sure how else to test the network, as it is all locked away somewhere. – Markus Jul 31 '13 at 00:24
  • noticing you upgraded 1 client, check this, last comment, to see if it might apply to your case? http://forums.gentoo.org/viewtopic-t-850532-start-0.html – user16081-JoeT Jul 31 '13 at 00:26
  • another possibility, broken rdns for a client machine's ip? https://bugs.launchpad.net/ubuntu/+source/nfs-utils/+bug/1089941 – user16081-JoeT Jul 31 '13 at 00:27
  • @user16081 Both solutions seem promising! I am giving the second one a try, but I am unsure how to apply the first one? – Markus Jul 31 '13 at 00:31
  • one more possibility before i run out the door http://serverfault.com/questions/188918/problem-with-nfs-server-lockd-timing-out-on-debian-linux – user16081-JoeT Jul 31 '13 at 00:31
  • @user16081 Fix 1: removing the `mdns4` from `nsswitch.conf` did not fix it, sadly. – Markus Jul 31 '13 at 00:40
  • is `statd` running? `ps -ef | grep -v grep | grep statd`, if so, which interfaces/ports does it have? `netstat -lnp | grep rpc.statd`, if not, start it, and paste back the output `/sbin/rpc.statd -F -d` – Tom Jul 31 '13 at 01:16
  • also `grep` for `statd` in whichever files your OS keeps it's system log; `grep statd /var/log/messages*` – Tom Jul 31 '13 at 01:21
  • Having this issue too. From what I gather, I cannot rule network issues out. @DavidSchwartz is right -- "Not responding" Simply means "I mailed the server a letter and it hasn't replied back." It could be any number of network issues. – djhaskin987 Jan 13 '15 at 17:40

4 Answers4

2

Here comes couple of suggestions:

I once managed to break the loopback interface (lo) and thanks to it several services, such as NFS, stopped working properly. See with ifconfig if you still have your beloved lo interface up and running. If it's not, go see /etc/network/interfaces and see what's going on.

Also as some people already mentioned, check the commands pgrep -v statd and netstat -tlnpu to see if statd is running.

Or perhaps someone has changed something under /etc at the server side? If you do not have /etc under version control, see if any files have been recently modified: find /etc -mtime -14 would show files changed during last 14 days, for example.

Janne Pikkarainen
  • 31,454
  • 4
  • 56
  • 78
1

Have a look here : http://sophiedogg.com/lockd-and-statd-nfs-errors/

Try :

# /etc/init.d/nfs-common stop
# /etc/init.d/nfs-kernel-server stop
# rm -rf /var/lib/nfs/statd/sm/*
# rm -rf /var/lib/nfs/statd/sm.bak/*
# /etc/init.d/nfs-common start
# /etc/init.d/nfs-kernel-server start

I had the same problem, and this solved it... but for only one month. I don't know why for now. I had to delete the files again today.

Marco
  • 11
  • 2
0

I had the same issue on a nfs debian squeeze server, and it seemed to be triggered by some new clients too (Fedora 20). Downgrading the clients was not an option for me, after some long, painful and unsuccessful debugging I ended up discovering a (different and probably unrelated) readdir loop bug on nfs exported ext4 filesystem with lots of file in it similar to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1240143

(I could be wrong, from the little I understood this was fixed on recent kernels, so debian squeeze may be affected)

long story short, to get rid at least of THAT bug I upgraded my nfs server to debian wheezy (forcing the nfs version to 3) and now (with the same filesystem and the same clients) it's been a week without the "cannot monitor"/"not responding" issue (before the upgrade it was a daily thing)

Brancomat
  • 21
  • 2
0

This worked for my case :

https://lists.debian.org/debian-user/2004/10/msg00932.html

Just edit the /etc/init.d/halt script, at the end there should be the line

halt -d -f -i $poweroff $hddown

The "-i" option makes all network interfaces to be shutdown, but this >seems to be too early for diskless clients, just try to remove this option, so

halt -d -f $poweroff $hddown

Note that my problem was with NFS on client with disk.

Marco
  • 11
  • 2