4

My Apache hangs frequently with multiple threads. Each process get stucked for hours. Backtrace looks like this:

(gdb) backtrace
#0  0x00002af60c22b2d7 in semop () from /lib64/libc.so.6
#1  0x00002af60bbf612c in ?? () from /usr/lib64/libapr-1.so.0
#2  0x000055555559e614 in ?? () from /usr/sbin/httpd2-prefork
#3  0x000055555559e9ea in ?? () from /usr/sbin/httpd2-prefork
#4  0x000055555559f25d in ap_mpm_run () from /usr/sbin/httpd2-prefork
#5  0x000055555557a080 in main () from /usr/sbin/httpd2-prefork

With strace I see they are waiting for a pipe that is connection all Apache processes.

strace -p 3069
....
read(7, 0x7fff16a04df7, 1)              = -1 EAGAIN (Resource temporarily unavailable)
semop(286162952, 0x2af60bd07dc0, 1 <unfinished ...>

What is Apache doing here?

How can I figure out what is causing this?

Update

Data as requested in comments

# ipcs -a

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
0x06347849 32768      root      666        65544      2
0x0c6629c9 21004289   root      640        1166952    2
0x3107040d 98306      root      666        131176     3
0x00000000 436994051  root      600        33554432   11         dest
0x01070756 191135748  root      664        4192       1
0x01070730 190349317  root      664        4192       1
0x01070736 190382086  root      664        4192       1
0x01070742 190414855  root      664        4192       1
0x01070746 190447624  root      664        4192       1
0x01070753 190545929  root      664        4192       1
0x0107075e 190611466  root      664        4192       1
0x01070750 191037451  root      664        4192       1
0x010706c8 21069838   root      664        4192       1
0x0107074d 191070223  root      664        4192       1

------ Semaphore Arrays --------
key        semid      owner      perms      nsems
0x0107000d 0          root      666        1
0x0107000e 32769      root      666        1
0x3107040d 98306      root      666        5
0x72070097 243433475  root      666        2
0x00000000 977469444  wwwrun    600        1
0x4d028007 262149     root      600        8
0x00000000 450166790  wwwrun    600        1
0x0107073f 1209401351 root      664        1
0x00000000 977502216  wwwrun    600        1
0x00000000 1208451083 root      600        1
0x01070751 1208582156 root      664        1
0x01070758 1208647693 root      664        1
0x00000000 1208680462 root      600        1
0x01070749 1209237519 root      664        1
0x0107074e 1209270289 root      664        1
0x00000000 1209303058 root      600        1
0x00000000 1209335827 root      600        1
0x00000000 1209434132 root      600        1

------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages

and

# ps auxwww | grep "apache"
wwwrun    2708  0.0  0.5 201576 11972 ?        S    Nov11   0:05 /usr/sbin/httpd2-prefork -f /etc/apache2/httpd.conf -DSSL
wwwrun    3607  0.0  0.6 202472 13388 ?        S    Nov11   0:06 /usr/sbin/httpd2-prefork -f /etc/apache2/httpd.conf -DSSL
root      5798  0.0  0.7 200828 14800 ?        Ss   Nov08   0:00 /usr/sbin/httpd2-prefork -f /etc/apache2/httpd.conf -DSSL
wwwrun   12926  0.0  0.5 201712 11768 ?        S    08:19   0:00 /usr/sbin/httpd2-prefork -f /etc/apache2/httpd.conf -DSSL
wwwrun   13009  0.0  0.6 202196 13340 ?        S    02:19   0:05 /usr/sbin/httpd2-prefork -f /etc/apache2/httpd.conf -DSSL

There are a few more processes but you get the image.

Also it is a Suse Server:

# cat /proc/version
Linux version 2.6.16.60-0.74.7-default (geeko@buildhost) (gcc version 4.1.2 20070115 (SUSE Linux)) #1 Fri Nov 26 09:16:10 UTC 2010

httpd.conf

# grep ^[^#] /etc/apache2/httpd.conf
Include /etc/apache2/uid.conf
Include /etc/apache2/server-tuning.conf
ErrorLog /var/log/apache2/error_log
Include /etc/apache2/sysconfig.d/loadmodule.conf
Include /etc/apache2/listen.conf
Include /etc/apache2/mod_log_config.conf
Include /etc/apache2/sysconfig.d/global.conf
Include /etc/apache2/mod_status.conf
Include /etc/apache2/mod_info.conf
Include /etc/apache2/mod_usertrack.conf
Include /etc/apache2/mod_autoindex-defaults.conf
TypesConfig /etc/apache2/mime.types
DefaultType text/plain
Include /etc/apache2/mod_mime-defaults.conf
Include /etc/apache2/errors.conf
Include /etc/apache2/ssl-global.conf
<Directory />
    Options None
    AllowOverride None
    Order deny,allow
    Deny from all
</Directory>
AccessFileName .htaccess
<Files ~ "^\.ht">
    Order allow,deny
    Deny from all
</Files>
DirectoryIndex index.html index.html.var
Include /etc/apache2/default-server.conf
Include /etc/apache2/sysconfig.d/include.conf
Include /etc/apache2/vhosts.d/*.conf

read(7 ,..) points to a pipe:

# ls -la /proc/3069/fd/7
lr-x------ 1 root   root 64 Nov  7 17:24 7 -> pipe:[157329520]

It connects all apache processes:

# lsof | grep 157329520
httpd2-pr  2430       root    7r     FIFO                0,5             157329520 pipe
httpd2-pr  2430       root    8w     FIFO                0,5             157329520 pipe
httpd2-pr  3061     wwwrun    7r     FIFO                0,5             157329520 pipe
httpd2-pr  3061     wwwrun    8w     FIFO                0,5             157329520 pipe
...

About the semaphore

# ipcs -s -i 39452680

Semaphore Array semid=39452680
uid=30   gid=8   cuid=0  cgid=0
mode=0600, access_perms=0600
nsems = 1
otime = Mon Nov 19 09:47:05 2012
ctime = Sun Nov 18 11:15:04 2012
semnum     value      ncount     zcount     pid
0          0          5          0          14678

The ncount always matches the number of idle workers from apache2ctl status so I belive the whole semop is just normal idel worker and has nothing to do with my problem...

PiTheNumber
  • 315
  • 2
  • 5
  • 18
  • As you use strace, I suppose your os is linux and the name of binaries suggest it's redhat or centos. Can you give us the relevant lines of ps auxwww and the output of ipcs -a please? Also consider having a look at server status. –  Nov 10 '12 at 10:00
  • 1
    Prefork MPM doesn't use threads. Maybe there's some kind of problem with the file system? – FINESEC Nov 10 '12 at 10:55
  • @FINESEC I am sorry, I meant process not thread. – PiTheNumber Nov 10 '12 at 12:13
  • @EricDANNIELOU I posted the rquested data as an update. Thank you for your time! – PiTheNumber Nov 12 '12 at 07:46
  • anything relevant in apache error log? Could we also have /etc/apache2/httpd.conf? –  Nov 12 '12 at 10:54
  • @EricDANNIELOU I added httpd.conf, error log shows a few `File does not exist` but nothing critical. – PiTheNumber Nov 12 '12 at 13:37
  • Can you also find out what file handle 7 is? e.g. in `read(7, 0x7fff16a04df7, 1)`. Either look at the strace output for an open() syscall or run `ls -l /proc/$PID/fd`. – Xiol Nov 12 '12 at 14:28
  • @Xiol It is a pipe. All processes are connected to it. I don't know what it's for. – PiTheNumber Nov 12 '12 at 14:38

2 Answers2

4

I believe you're tripping over a sparsely-known issue. It seems to be a bug in Linux, where the semephore count is already 0, but processes wait as if it's not. I do not understand the mechanics of this bug, but it apparently happens only on loaded machines.

Run ipcs -s -i $SEM_ID where $SEM_ID is the first argument given to semop(). It should show the count to be 0, which would confirm the problem is in Linux, and not Apache. If the value is anything but 0, the problem would be in Apache's code.

It appears you haven't updated the kernel in about 2 years, there may have been a fix since then. Others have reported that the epoll path limit of 1000 prevents Apache from using more than 1000 "max clients" setting.

Chris S
  • 77,337
  • 11
  • 120
  • 212
  • I made an update. The `ncount` of the semaphore matches the number of idle workers from `apache2ctl status` so I belive the whole semop is just normal idel worker and has nothing to do with my problem... – PiTheNumber Nov 19 '12 at 08:53
  • @Chris: Could you give a link to the relevant bug report? – Alex Jun 19 '14 at 09:49
  • @Alex Sorry, I really don't remember where I looked it up. I assume since I didn't link to a bug report that I couldn't find one at the time. – Chris S Jun 19 '14 at 13:55
0

If anyone else stumbles upon this thread.

We encountered an issue in productie with OCSP stapling which saw all childprocesses hanging in semop after TCP connection was established, but before TLS handshake finish. Apparently the main server was waiting for an OCSP staple from a non responding OCSP server. Also clients may keep hanging in the TLS handshake waiting for their own verification.

Gerrit
  • 1,347
  • 7
  • 8