Why does my system hang when I run ps, w and possibly other commands?

I don't know why, but I cannot run the ps or w commands on my Ubuntu 10.04LTS machine. I did a trace for both, and both stopped when reading part of the same file.

Here's the result of running ps

And here's the result for w

The file also stopped when reading ... itself? http://pastebin.com/9qRB5eHh

What is happening?

user69239

Posted 2011-02-26T01:56:50.683

Reputation: 263

Answers

I had that happen once when an NFS server went down.

The fact that it's hung trying to read information about pid 17398, and pid 17398 is in D (disk wait) state, suggests that could be the cause for you too.

read(6, "Name:\tconvert\nState:\tD (disk sle"..., 1023) = 664
open("/proc/17398/cmdline", O_RDONLY)   = 6

If you do have NFS mounts, I think the best option is to try to bring the NFS server back up.

Otherwise, umount -f <mount> might help.

Mikel

Posted 2011-02-26T01:56:50.683

Reputation: 7 890

I guess rebooting should repair this, but I don't want to do it, since I want to be sure what is the cause of this :P – None – 2011-02-26T02:12:04.860

What does mount say? Note there's a chance that might hang too (I don't think it should, but I'm not 100% sure). – Mikel – 2011-02-26T02:13:22.743

sigh closed-question handling is pretty poor, this'll be the third time I try typing this, so please forgive the terseness.

First, use intr NFS mounts. The default hard NFS mounts hang forever. soft NFS mounts error out after a time out (which might be stupid for transient errors.) intr lets you decide to interrupt a hung NFS operation. Just right.

Second, to fix this stupid problem, I've used a stupid trick before, it probably still works. Bring up an interface alias on lo with the NFS Server's IP Address (edit: ifconfig eth0:0 <ipaddress>). Create an /etc/exports file that contains a line to export the filesystem that you're hung on (edit: export a filesystem with the same name as the 'hung' filesystem; you'll have to create the same pathname as what you've mounted). Start your NFS server on your local machine, and hopefully your hung program can error out with "file not found" or "directory not found" or something like that, letting you get on with your work without rebooting.

Don't forget to turn off your NFS server again and remove the interface alias when you're done.

sarnold

Posted 2011-02-26T01:56:50.683

Reputation: 2 988

I use intr NFS mounts ... where did you see contrary ? – user69239 – 2011-02-26T02:40:20.953

About the "second paragraph" i don't understand it very well ... sorry ! :( – user69239 – 2011-02-26T02:42:02.340

@user69239 you didn't give any details, so I assumed you were using the default hard mount type. :) – sarnold – 2011-02-26T02:43:23.393

@user69239, I've slightly expanded the paragraph with the silly trick -- I hope it is easier to understand now. :) – sarnold – 2011-02-26T02:49:03.127

I'm not sure why the focus on NFS? Is the asker running NFS? Didn't see anything about that.

Anyways, this is a very strange problem since its /proc. Try the following things to give yourself more info about the problem:

Go into /proc and find other pid directories and try reading the cmdline files from those directories.
Try reading /proc/pid/stat as well, if that doesn't work, I'd say your system is having kernel issues.
Are you able to run netstat -n? This reads from different parts of /proc so it might work and would indicate less of a problem with the proc interface.
Try remounting /proc with mount -o remount /proc although I have no idea what this would do in this situation.

I would just suggest rebooting. If you can't read stuff from proc, I'm not sure what you're going to find through other methods. If it happens again, then start worrying.

deltaray

Posted 2011-02-26T01:56:50.683

Reputation: 1 665

Yes, i'm with NFS ... the problem is copying a file through a s3 fuse system – user69239 – 2011-02-26T07:43:48.577

fuse and nfs are different things. I think you need to provide all the details you can. You provided a lot with an strace, but knowing more about your setup and how this all started is a necessary first step. – deltaray – 2011-02-26T14:40:11.690

The problem appeared doing a massive "cp" command from my host (an small instance of Amazon) to an S3, connected with s3f3 1.40 with just the allow_other option. My Ubuntu is a normal base 10.10 installation with the typical upgrade. Nothing else. Really :P – user69239 – 2011-02-26T16:02:04.933