IO-intensive processes hang with iowait, but no activity going on

Question

I have a bunch of IO-intensive jobs, and to boost performance, I just installed two SSDs in a compute server, one as a scratch file system, one as swap. After running for some time, all my processes hang in "D" state, consume no CPU, and the system reports 67% idle, and 33% wait. An iostat shows no disk activity going on, and the system is otherwise responsive, including the relevant file systems. Attaching a 'strace' to the processes produce no output.

Looking in /proc/(pid)/fd, I discover that all processes are using (reading) one common file. I can't see any reason why this should cause a problem, but I replaced the file, killed the processes, and let everything continue (i.e. new processes will be launced). We'll see if things get stuck on the new file, on a different file, or - ideally - not at all :-)

I also found a couple of these in kern.log:

BUG: unable to handle kernel paging request at ffffeb8800096e5c

Lots of other information, but I don't know how to decipher it - except that it refers to the PID and name of one of my processes.

Any idea what is going on here, or how to fix it? This is on Ubuntu 12.04 LTS, Dell-something box with a RocketRaid disk controller and btrfs file system.

Let's eliminate filesystem bugs from the equation. Can you try it with ext3 or ext4 instead of btrfs? — ewwhite, May 20 '12 at 13:43
Yes. Since I need the results, and organizing the data is a largish operation, I'll try to get through this run, but then I can reformat the scratch disk, and see how ext4 compares. — Ketil, May 20 '12 at 15:31
This is really looks like some kernel bug - it could be filesystem, block device driver or controller firmware bug. — DukeLion, May 20 '12 at 16:51

score 0 · Answer 1 · answered May 20 '12 at 18:53

0

This seems like it could be a memory problem. Boot memtest and check your ram.

answered May 20 '12 at 18:53

user9517

114,104
20
206
289

IO-intensive processes hang with iowait, but no activity going on

1 Answers1