13

I'm having some issues with java process and nrpe checks. We have some processes that sometimes use 1000% cpu on a 32 core system. The system is pretty responsive until you do a

ps aux 

or try to do anything in the /proc/pid# like

[root@flume07.domain.com /proc/18679]# ls
hangs..

A strace of ps aux

stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2819, ...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2819, ...}) = 0
stat("/dev/pts1", 0x7fffb8526f00)       = -1 ENOENT (No such file or directory)
stat("/dev/pts", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
readlink("/proc/15693/fd/2", "/dev/pts/1", 127) = 10
stat("/dev/pts/1", {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 1), ...}) = 0
write(1, "root     15693 15692  0 06:25 pt"..., 55root     15693 15692  0 06:25 pts/1    00:00:00 ps -Af
) = 55
stat("/proc/18679", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
open("/proc/18679/stat", O_RDONLY)      = 5
read(5, "18679 (java) S 1 18662 3738 3481"..., 1023) = 264
close(5)                                = 0
open("/proc/18679/status", O_RDONLY)    = 5
read(5, "Name:\tjava\nState:\tS (sleeping)\nT"..., 1023) = 889
close(5)                                = 0
open("/proc/18679/cmdline", O_RDONLY)   = 5
read(5,

the java process is working and will complete just fine but the issue is it makes our monitoring go nuts thinking processes are down because it timeouts waiting for a ps aux to complete.

I've tried doing something like

 nice -19 ionice -c1 /usr/lib64/nagios/plugins/check_procs -w 1:1 -c 1:1 -a 'diamond' -u root -t 30

with no luck

EDIT

System specs

  • 32 core Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
  • 128gig of ram
  • 12 4Tb 7200 drives
  • CentOS 6.5
  • I'm not sure the model but the vendor is SuperMicro

The load when this happens is around 90-160ish for 1 minute.

The odd part is I can go into any other /proc/pid# and it works just fine. The system is responsive when i ssh in. Like when we get alerted of high load I can ssh right in just fine.

Another edit

I've been using deadline for scheduler

[root@dn07.domain.com ~]# for i in {a..m}; do cat /sys/block/sd${i}/queue/scheduler; done
noop anticipatory [deadline] cfq
noop anticipatory [deadline] cfq
noop anticipatory [deadline] cfq
noop anticipatory [deadline] cfq
noop anticipatory [deadline] cfq
noop anticipatory [deadline] cfq
noop anticipatory [deadline] cfq
noop anticipatory [deadline] cfq
noop anticipatory [deadline] cfq
noop anticipatory [deadline] cfq
noop anticipatory [deadline] cfq
noop anticipatory [deadline] cfq
noop anticipatory [deadline] cfq

Mount looks like

[root@dn07.manage.com ~]# mount
/dev/sda3 on / type ext4 (rw,noatime,barrier=0)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
tmpfs on /dev/shm type tmpfs (rw)
/dev/sda1 on /boot type ext2 (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
/dev/sdb1 on /disk1 type xfs (rw,nobarrier)
/dev/sdc1 on /disk2 type xfs (rw,nobarrier)
/dev/sdd1 on /disk3 type xfs (rw,nobarrier)
/dev/sde1 on /disk4 type xfs (rw,nobarrier)
/dev/sdf1 on /disk5 type xfs (rw,nobarrier)
/dev/sdg1 on /disk6 type xfs (rw,nobarrier)
/dev/sdh1 on /disk7 type xfs (rw,nobarrier)
/dev/sdi1 on /disk8 type xfs (rw,nobarrier)
/dev/sdj1 on /disk9 type xfs (rw,nobarrier)
/dev/sdk1 on /disk10 type xfs (rw,nobarrier)
/dev/sdl1 on /disk11 type xfs (rw,nobarrier)
/dev/sdm1 on /disk12 type xfs (rw,nobarrier)

Ok I tried to install tuned and have it set to throughput performance.

[root@dn07.domain.com ~]# tuned-adm profile throughput-performance
Switching to profile 'throughput-performance'
Applying deadline elevator: sda sdb sdc sdd sde sdf sdg sdh[  OK  ] sdk sdl sdm
Applying ktune sysctl settings:
/etc/ktune.d/tunedadm.conf:                                [  OK  ]
Calling '/etc/ktune.d/tunedadm.sh start':                  [  OK  ]
Applying sysctl settings from /etc/sysctl.d/99-chef-attributes.conf
Applying sysctl settings from /etc/sysctl.conf
Starting tuned:                                            [  OK  ]
ewwhite
  • 194,921
  • 91
  • 434
  • 799
Mike
  • 21,910
  • 7
  • 55
  • 79
  • Can you provide information on the server environment? The OS distribution and version, hardware platform would be relevant. – ewwhite Oct 30 '14 at 15:27
  • Your system load at the point when this happens is also important. – ewwhite Oct 30 '14 at 15:28
  • I made some edits with specs and what the load is – Mike Oct 30 '14 at 15:35
  • What does the output of `mount` look like? – ewwhite Oct 30 '14 at 15:48
  • Very good. Consider using the `tuned-adm profile enterprise-storage` command to handle the nobarrier and deadline switch. What does `dmesg|tail` output show? Are you seeing I/O timeouts? – ewwhite Oct 30 '14 at 16:10
  • That's the strange part.. I don't see timeouts anywhere and the system is still pretty responsive. Just anything accessing the process tree will hang. – Mike Oct 30 '14 at 16:12
  • Is `auditd` running? – ewwhite Oct 30 '14 at 16:13
  • auditd is not running.. it was when I started at the company everywhere which was a giant issue they were running into though. – Mike Oct 30 '14 at 16:14
  • This question is being [discussed on meta](http://meta.serverfault.com/q/6635/126632). You may wish to participate. – Michael Hampton Nov 04 '14 at 14:26

4 Answers4

9

In general, I've seen this happen because of a stalled-read. This is confirmed by your strace output. The attempt to read /proc/xxxx/cmdline file hangs while you're running ps aux command.

The momentary spikes in I/O are starving the system's resources. A load of 90-160 is extremely bad news if it's storage subsystem-related.

For the storage array, can you tell us if there's a hardware RAID controller in place? Is the primary application on the server write-biased? The disks you mention (12 x 4TB) are lower-speed nearline SAS or SATA disks. If there's no form of write caching in front of the drive array, writes are capable of pushing the system load way up. If these are pure SATA drives on a Supermicro backplane, don't discount the possibility of other disk problems (timeouts, failing drive, backplane, etc.) Does this happen on all Hadoop nodes?

An easy test is to try to run iotop while this is happening. Also, since this is EL6.5, do you have any of the tuned-adm settings enabled? Are write barriers enabled?

If you haven't changed the server's I/O elevator, ionice may have an impact. If you've changed it to anything other than CFQ, (this server should probably be on deadline), ionice won't make any difference.

Edit:

One other weird thing I've seen in production environments. These are Java processes, and I'll assume they're heavily multithreaded. How are you doing on PIDs? What's the sysctl value for kernel.pid_max? I've had situations where I've exhausted PIDs before and had a resulting high load.

Also, you mention kernel version 2.6.32-358.23.2.el6.x86_64. That's over a year old and part of the CentOS 6.4 release, but the rest of your server is 6.5. Did you blacklist kernel updates in yum.conf? You should probably be on kernel 2.6.32-431.x.x or newer for that system. There could be a hugepages issue with the older kernel you have. If you can't change the kernel, try disabling them with:

echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled.

ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • there's a raid card but its just used for handling 12 drives on the server. Its part of a Hadoop cluster so it does a lot of writing but also these lock ups come in place when yarn is pulling a lot of data for a map reduce job. – Mike Oct 30 '14 at 15:44
  • I'm getting the datacenter to call me to see if they know what the raid controller is set to for write cache. As for card its a `3a0613065fa Adaptec \ 71605 \ SATA/SAS RAID ` I verified they are SATA drives also `Western Digital WD RE WD4000FYYZ` – Mike Oct 30 '14 at 16:02
  • Make sure the disks are actually okay. If this only occurs on this particular server, see if others have the problem. What is the kernel version? – ewwhite Oct 30 '14 at 16:17
  • we had some servers with dead disks but it happens on servers that are fine. 2.6.32-358.23.2.el6.x86_64 – Mike Oct 30 '14 at 16:20
  • @Mike I'd suggest a kernel update and repeat. The one you're running belongs to EL6.4, not EL6.5 – ewwhite Oct 30 '14 at 16:28
  • Last comment: Can you `cat /sys/kernel/mm/redhat_transparent_hugepage/enabled` ? – ewwhite Oct 30 '14 at 16:38
  • here's what I got `[always] madvise never` – Mike Oct 30 '14 at 16:44
  • 1
    @mike If you can't make the kernel change, try: `echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled` on an affected machine. I'm assuming this is reproducible enough that you can observe a before/after with this setting. – ewwhite Oct 30 '14 at 16:46
  • 4
    looks like the tuned and disabling the hugepage helped fix the problem! – Mike Oct 30 '14 at 16:51
  • 1
    @Mike Excellent. A kernel update may also provide some relief. But if you're stuck with the running kernel, I'm glad this fix works. – ewwhite Oct 30 '14 at 16:54
3

The problem is clear not a disk related problem. And this is clear from the hanged strace:

open("/proc/18679/cmdline", O_RDONLY)   = 5
read(5,

/proc is an interface between kernel and userspace. It does not touch the disk at all. If something is hanged reading the arguments of a command it is usually a kernel related problem, and unlikely a storage one. See the @kasperd comment.

The load is just a side effect of the problem and the high number does not tell the full story. You could have a server with very high load on which the application behaves without any glitch.

You can gain more information about what is happening with cat /proc/$PID/stack. Where $PID is the process ID where the read stalls.

In your case I would start with a kernel upgrade.

Mircea Vutcovici
  • 16,706
  • 4
  • 52
  • 80
  • 2
    You are mistaken. What is returned by reading `/proc/%d/cmdline` is the part of the process' address space in which the kernel stored the command line during the `execve` call. Like any other part of user space it may be swapped out. So accessing it may indeed have to wait for the page to be swapped in again. – kasperd Nov 05 '14 at 19:21
  • This is a very good argument. Thank you for rising up. However I think that the chances for strace to start when your swap is not answering are low, but not impossible. I will update my answer. – Mircea Vutcovici Nov 05 '14 at 21:47
2

So even with all the tweaks and a upgrade to the latest 2.6 kernel that CentOS provides we were still seeing the hangs. Not as much as before but still seeing them.

The fix was to upgrade to the 3.10.x series kernel that CentOS provides in their centosplus repo here

http://mirror.centos.org/centos/6/xen4/x86_64/Packages/

This has done away with all process tree hangs. Like I said the system wasn't under any crazy load where running new processes wasn't snappy. So most be a 2.6 kernel issue somewhere.

Mike
  • 21,910
  • 7
  • 55
  • 79
0

This is another fix.

Looks like we are running the following raid controller

Adaptec 71605

I have been doing firmware updates to all affected machines to the latest version and it seems to be clearing up the problem.

We had to downgrade from the 3.10 kernel experiment due to other random issues installing 3.10 on CentOS 6 but the firmware upgrade seems to fix the issue.

Mike
  • 21,910
  • 7
  • 55
  • 79