2

I have a bunch of IO-intensive jobs, and to boost performance, I just installed two SSDs in a compute server, one as a scratch file system, one as swap. After running for some time, all my processes hang in "D" state, consume no CPU, and the system reports 67% idle, and 33% wait. An iostat shows no disk activity going on, and the system is otherwise responsive, including the relevant file systems. Attaching a 'strace' to the processes produce no output.

Looking in /proc/(pid)/fd, I discover that all processes are using (reading) one common file. I can't see any reason why this should cause a problem, but I replaced the file, killed the processes, and let everything continue (i.e. new processes will be launced). We'll see if things get stuck on the new file, on a different file, or - ideally - not at all :-)

I also found a couple of these in kern.log:

BUG: unable to handle kernel paging request at ffffeb8800096e5c

Lots of other information, but I don't know how to decipher it - except that it refers to the PID and name of one of my processes.

Any idea what is going on here, or how to fix it? This is on Ubuntu 12.04 LTS, Dell-something box with a RocketRaid disk controller and btrfs file system.

Ketil
  • 21
  • 4
  • Let's eliminate filesystem bugs from the equation. Can you try it with ext3 or ext4 instead of btrfs? – ewwhite May 20 '12 at 13:43
  • 1
    Yes. Since I need the results, and organizing the data is a largish operation, I'll try to get through this run, but then I can reformat the scratch disk, and see how ext4 compares. – Ketil May 20 '12 at 15:31
  • This is really looks like some kernel bug - it could be filesystem, block device driver or controller firmware bug. – DukeLion May 20 '12 at 16:51

1 Answers1

0

This seems like it could be a memory problem. Boot memtest and check your ram.

user9517
  • 114,104
  • 20
  • 206
  • 289