I'm running Apache 2.2, preform MPM, on Debian Squeeze with mod_php (PHP 5.3), listening on several ports (with memcached running on the same machine). Over the past 6 weeks or so, I've been experiencing situations where all Apache connections are timing out/deadlocking in large groups. Restarting Apache will fix the problem, for somewhere between hours and days.
The PHP apps running on the server include a user-facing Web UI, in addition to a Backend REST API running on a separate port on the same machine. The Web UI app makes between 1 and 3 sequential REST requests to the Backend API via http://127.0.0.1:8888/
on most pageloads. These timeout errors are first noticed in PHP's error log, as timeout errors of the REST Client after 30 seconds. Matching timeouts can also be seen in Apache's error log. They come in bursts, where all requests will timeout, then a batch of requests will process (seemingly) quickly and successfully, followed by another batch of timeouts. Over a time window of a few minutes, 8-25 seconds will be nothing but timeouts, followed by maybe a single second of requests being processed, followed by anoher 8-25 seconds of timeouts. If the "timeout window" before the "request burst" is shorter, the one following it will be longer. While this problem is presenting, a curl http://localhost/server-status
or a curl
request to a PHP script in either of the apps will either timeout and fail, or sometimes return between 25 and 60 seconds. During this period, CPU is available, RAM is available, IO appears normal (although this is one thing I'm not the best at monitoring), and the amount of sockets in SYN_RECV
state is much lower than both the system's somaxconn
and Apache's ListenBacklog
(but I do see sockets waiting in the backlog).
During my debugging and research, I've noticed many things that just aren't making sense. For one thing, in my above curl http://localhost/server-status
example, curl
connects and sends data successfully, but never gets a response. Even after a solid 45+ minutes, Apache doesn't seem to reset the connection or anything, and curl
just hangs and waits. For example:
~# curl -vv http://localhost/server-status
* About to connect() to localhost port 80 (#0)
* Trying 127.0.0.1... connected
* Connected to localhost (127.0.0.1) port 80 (#0)
> GET /server-status HTTP/1.1
> User-Agent: curl/7.21.0 (x86_64-pc-linux-gnu) libcurl/7.21.0 OpenSSL/0.9.8o zlib/1.2.3.4 libidn/1.15 libssh2/1.2.6
> Host: localhost
> Accept: */*
>
If I restart Apache, things will return to normal.
Machine has 4GB total; 768M used by MemCache), 4 CPU cores, and the resident memory size for each Apache process varies between 20MB and 30MB. MaxClients
has been set to 100
, which, by my calculations and observations, runs within resource constraints. My first difficulty was reproducing the problem, as I've only seen it actually manifest around 9 times in 6 weeks. On a hunch, I lowered MaxClients
to 10
(eleminating resource constraints) to see if I could trigger the problem, and it successfully reproduced--the "bursty" pattern of timeouts immediately started occurring. The TCP backlog fills to around 10% of capacity, and attempting to make a request as above never returns (even though bursts of requests get through and complete; but my curl
requests never seem to make it through a burst). Timeout
is set to 500
, KeepAliveTimeout
is set to 5
. sys.fs.file-max
/sys.fs.file-nr
shows <1% utilization. nofile
ulimit is well above FDs used per process. Feeling stuck and lacking visibility, I started strace
ing my Apache processes. After careful analysis, nothing really stood out, no errors, no leaking FDs from what I could tell. However, during the "timeout periods", I do see quite a bit of:
[17284] poll([{fd=47, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
In cadeit may be helpful, here's a couple "cycles" of refined strace
output across all running processes:
ps auxw | grep sbin/apache | awk '{print"-p " $2}' | xargs strace -e poll,select,connect,recvfrom,sendto,flock
generates:
[17293] flock(46, LOCK_EX <unfinished ...>
[17292] flock(46, LOCK_EX <unfinished ...>
[17291] flock(46, LOCK_EX <unfinished ...>
[17290] flock(46, LOCK_EX <unfinished ...>
[17288] flock(46, LOCK_EX <unfinished ...>
[17285] flock(46, LOCK_EX <unfinished ...>
[17283] flock(46, LOCK_EX <unfinished ...>
[17276] select(0, NULL, NULL, NULL, {0, 618967} <unfinished ...>
[17287] poll([{fd=47, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
[17287] poll([{fd=47, events=POLLIN|POLLPRI}], 1, 1000 <unfinished ...>
[17276] <... select resumed> ) = 0 (Timeout)
[17276] select(0, NULL, NULL, NULL, {1, 0} <unfinished ...>
[17289] poll([{fd=47, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
[17289] poll([{fd=47, events=POLLIN|POLLPRI}], 1, 1000 <unfinished ...>
[17284] poll([{fd=47, [...] (Timeout)
[17284] poll([{fd=47, [...] <unfinished ...>
[17287] <... poll resumed> ) = 0 (Timeout)
[17287] poll([{fd=47, [...] (Timeout)
[17287] poll([{fd=47, [...] <unfinished ...>
[17276] <... select resumed> ) = 0 (Timeout)
[17276] select(0, NULL, NULL, NULL, {1, 0} <unfinished ...>
[17289] <... poll resumed> ) = 0 (Timeout)
[17289] poll([{fd=47, [...] (Timeout)
[17289] poll([{fd=47, [...] <unfinished ...>
[17284] <... poll resumed> ) = 0 (Timeout)
[17284] poll([{fd=47, [...] (Timeout)
[17284] poll([{fd=47, [...] <unfinished ...>
[17287] <... poll resumed> ) = 0 (Timeout)
[17287] poll([{fd=47, [...] (Timeout)
[17287] poll([{fd=47, [...] <unfinished ...>
[17276] <... select resumed> ) = 0 (Timeout)
[17276] select(0, NULL, NULL, NULL, {1, 0} <unfinished ...>
[17289] <... poll resumed> ) = 0 (Timeout)
[17289] poll([{fd=47, [...] (Timeout)
[17289] poll([{fd=47, [...] <unfinished ...>
[17284] <... poll resumed> ) = 0 (Timeout)
[17284] poll([{fd=47, [...] (Timeout)
[17284] poll([{fd=47, [...] <unfinished ...>
[17287] <... poll resumed> ) = 0 (Timeout)
[17287] poll([{fd=47, [...] (Timeout)
[17287] poll([{fd=47, [...] <unfinished ...>
[17276] <... select resumed> ) = 0 (Timeout)
[17276] select(0, NULL, NULL, NULL, {1, 0} <unfinished ...>
[17289] <... poll resumed> ) = 0 (Timeout)
[17289] poll([{fd=47, [...] (Timeout)
[17289] poll([{fd=47, [...] <unfinished ...>
[17284] <... poll resumed> ) = 0 (Timeout)
[17284] poll([{fd=47, [...] (Timeout)
[17284] poll([{fd=47, [...] <unfinished ...>
[17287] <... poll resumed> ) = 0 (Timeout)
[17287] poll([{fd=47, [...] (Timeout)
[17287] poll([{fd=47, [...] <unfinished ...>
[17276] <... select resumed> ) = 0 (Timeout)
[17276] select(0, NULL, NULL, NULL, {1, 0} <unfinished ...>
[17289] <... poll resumed> ) = 0 (Timeout)
[17289] poll([{fd=47, [...] (Timeout)
[17289] poll([{fd=47, [...] <unfinished ...>
[17284] <... poll resumed> ) = 0 (Timeout)
[17284] poll([{fd=47, [...] (Timeout)
[17284] poll([{fd=47, [...] <unfinished ...>
[17287] <... poll resumed> ) = 0 (Timeout)
[17287] poll([{fd=47, [...] (Timeout)
[17287] poll([{fd=47, [...] <unfinished ...>
[17276] <... select resumed> ) = 0 (Timeout)
[17276] select(0, NULL, NULL, NULL, {1, 0} <unfinished ...>
[17289] <... poll resumed> ) = 0 (Timeout)
[17289] poll([{fd=47, [...] (Timeout)
[17289] poll([{fd=47, [...] <unfinished ...>
[17284] <... poll resumed> ) = 0 (Timeout)
[17284] poll([{fd=47, [...] (Timeout)
[17284] poll([{fd=47, [...] <unfinished ...>
[17287] <... poll resumed> ) = 0 (Timeout)
[17287] poll([{fd=47, [...] (Timeout)
[17287] poll([{fd=47, [...] <unfinished ...>
[17276] <... select resumed> ) = 0 (Timeout)
[17276] select(0, NULL, NULL, NULL, {1, 0} <unfinished ...>
[17289] <... poll resumed> ) = 0 (Timeout)
[17289] poll([{fd=47, [...] (Timeout)
[17289] poll([{fd=47, events=POLLIN|POLLPRI}], 1, 964 <unfinished ...>
[17284] <... poll resumed> ) = 0 (Timeout)
[17284] poll([{fd=47, [...] (Timeout)
[17284] poll([{fd=47, [...] <unfinished ...>
[17287] <... poll resumed> ) = 0 (Timeout)
[17287] poll([{fd=47, [...] (Timeout)
[17287] poll([{fd=47, events=POLLIN|POLLPRI}], 1, 1000 <unfinished ...>
[17289] <... poll resumed> ) = 0 (Timeout)
[17289] poll([{fd=47, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
[17289] --- SIGCHLD (Child exited) @ 0 (0) ---
[17289] sendto(47, "\24\0...", 20, ...) = 20
[17289] connect(47, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17289] poll([{fd=47, events=POLLIN|POLLOUT|POLLERR|POLLHUP}], 1, 1000) = 1 ([{fd=47, revents=POLLOUT}])
[17289] select(48, [47], [47], NULL, {1, 0}) = 1 (out [47], left {0, 999997})
[17289] sendto(47, "<MC>"..., 35, MSG_NOSIGNAL, NULL, 0) = 35
[17289] select(48, [47], [], NULL, {1, 0}) = 1 (in [47], left {0, 999997})
[17289] recvfrom(47, "<MC>"..., 32768, 0, NULL, NULL) = 7280
[17289] connect(47, ...(8888)...("127.0.0.1")...) = -1 EINPROGRESS (...)
[17289] poll([{fd=47, events=POLLOUT}], 1, 1000) = 1 ([{fd=47, revents=POLLOUT}])
[17289] sendto(47, "GET /<URL>"..., 390, MSG_NOSIGNAL, NULL, 0) = 390
[17289] poll([{fd=47, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
[17289] poll([{fd=47, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
[17289] poll([{fd=47, events=POLLIN|POLLPRI}], 1, 1000 <unfinished ...>
[17276] <... select resumed> ) = 0 (Timeout)
[17276] select(0, NULL, NULL, NULL, {1, 0} <unfinished ...>
[17284] <... poll resumed> ) = 0 (Timeout)
[17284] poll([{fd=47, [...] (Timeout)
[17284] poll([{fd=47, [...] <unfinished ...>
[17287] <... poll resumed> ) = 0 (Timeout)
[17287] poll([{fd=47, [...] (Timeout)
[17287] poll([{fd=47, [...] <unfinished ...>
[17289] <... poll resumed> ) = 0 (Timeout)
[17289] poll([{fd=47, [...] (Timeout)
[17289] poll([{fd=47, [...] <unfinished ...>
[17276] <... select resumed> ) = 0 (Timeout)
[17276] select(0, NULL, NULL, NULL, {1, 0} <unfinished ...>
[17284] <... poll resumed> ) = 0 (Timeout)
[17284] poll([{fd=47, [...] (Timeout)
[17284] poll([{fd=47, [...] <unfinished ...>
[17287] <... poll resumed> ) = 0 (Timeout)
[17287] poll([{fd=47, [...] (Timeout)
[17287] poll([{fd=47, [...] <unfinished ...>
[17289] <... poll resumed> ) = 0 (Timeout)
[17289] poll([{fd=47, [...] (Timeout)
[17289] poll([{fd=47, [...] <unfinished ...>
[17276] <... select resumed> ) = 0 (Timeout)
[17276] select(0, NULL, NULL, NULL, {1, 0} <unfinished ...>
[17284] <... poll resumed> ) = 0 (Timeout)
[17284] poll([{fd=47, [...] (Timeout)
[17284] poll([{fd=47, [...] <unfinished ...>
[17287] <... poll resumed> ) = 0 (Timeout)
[17287] poll([{fd=47, [...] (Timeout)
[17287] poll([{fd=47, [...] <unfinished ...>
[17289] <... poll resumed> ) = 0 (Timeout)
[17289] poll([{fd=47, [...] (Timeout)
[17289] poll([{fd=47, [...] <unfinished ...>
[17276] <... select resumed> ) = 0 (Timeout)
[17276] select(0, NULL, NULL, NULL, {1, 0} <unfinished ...>
[17284] <... poll resumed> ) = 0 (Timeout)
[17284] poll([{fd=47, [...] (Timeout)
[17284] poll([{fd=47, [...] <unfinished ...>
[17287] <... poll resumed> ) = 0 (Timeout)
[17287] poll([{fd=47, [...] (Timeout)
[17287] poll([{fd=47, [...] <unfinished ...>
[17289] <... poll resumed> ) = 0 (Timeout)
[17289] poll([{fd=47, [...] (Timeout)
[17289] poll([{fd=47, [...] <unfinished ...>
[17276] <... select resumed> ) = 0 (Timeout)
[17276] select(0, NULL, NULL, NULL, {1, 0} <unfinished ...>
[17284] <... poll resumed> ) = 0 (Timeout)
[17284] poll([{fd=47, [...] (Timeout)
[17284] poll([{fd=47, [...] <unfinished ...>
[17287] <... poll resumed> ) = 0 (Timeout)
[17287] poll([{fd=47, [...] (Timeout)
[17287] poll([{fd=47, [...] <unfinished ...>
[17289] <... poll resumed> ) = 0 (Timeout)
[17289] poll([{fd=47, [...] (Timeout)
[17289] poll([{fd=47, [...] <unfinished ...>
[17276] <... select resumed> ) = 0 (Timeout)
[17276] select(0, NULL, NULL, NULL, {1, 0} <unfinished ...>
[17284] <... poll resumed> ) = 0 (Timeout)
[17284] poll([{fd=47, [...] (Timeout)
[17284] poll([{fd=47, [...] <unfinished ...>
[17287] <... poll resumed> ) = 0 (Timeout)
[17287] poll([{fd=47, [...] (Timeout)
[17287] poll([{fd=47, events=POLLIN|POLLPRI}], 1, 963 <unfinished ...>
[17289] <... poll resumed> ) = 0 (Timeout)
[17289] poll([{fd=47, [...] (Timeout)
[17289] poll([{fd=47, [...] <unfinished ...>
[17276] <... select resumed> ) = 0 (Timeout)
[17276] select(0, NULL, NULL, NULL, {1, 0} <unfinished ...>
[17284] <... poll resumed> ) = 0 (Timeout)
[17284] poll([{fd=47, [...] (Timeout)
[17284] poll([{fd=47, events=POLLIN|POLLPRI}], 1, 1000 <unfinished ...>
[17287] <... poll resumed> ) = 0 (Timeout)
[17287] poll([{fd=47, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
[17287] --- SIGCHLD (Child exited) @ 0 (0) ---
[17292] <... flock resumed> ) = 0
[17292] sendto(47, "\24\0...", 20, ...) = 20
[17292] connect(47, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17292] poll([{fd=47, events=POLLIN|POLLOUT|POLLERR|POLLHUP}], 1, 1000) = 1 ([{fd=47, revents=POLLOUT}])
[17292] select(48, [47], [47], NULL, {1, 0}) = 1 (out [47], left {0, 999997})
[17292] sendto(47, "<MC>"..., 37, MSG_NOSIGNAL, NULL, 0) = 37
[17292] select(48, [47], [], NULL, {1, 0}) = 1 (in [47], left {0, 999997})
[17292] recvfrom(47, "<MC>"..., 32768, 0, NULL, NULL) = 22479
[17285] <... flock resumed> ) = 0
[17285] sendto(47, "\24\0...", 20, ...) = 20
[17285] connect(47, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17285] poll([{fd=47, events=POLLIN|POLLOUT|POLLERR|POLLHUP}], 1, 1000) = 1 ([{fd=47, revents=POLLOUT}])
[17285] select(48, [47], [47], NULL, {1, 0}) = 1 (out [47], left {0, 999997})
[17285] sendto(47, "<MC>"..., 37, MSG_NOSIGNAL, NULL, 0) = 37
[17285] select(48, [47], [], NULL, {1, 0}) = 1 (in [47], left {0, 999997})
[17285] recvfrom(47, "<MC>"..., 32768, 0, NULL, NULL) = 22479
[17285] sendto(47, "\24\0...", 20, ...) = 20
[17285] connect(47, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17285] poll([{fd=47, events=POLLIN|POLLOUT|POLLERR|POLLHUP}], 1, 1000) = 1 ([{fd=47, revents=POLLOUT}])
[17285] select(48, [47], [47], NULL, {1, 0}) = 1 (out [47], left {0, 999997})
[17285] sendto(47, "<MC>"..., 35, MSG_NOSIGNAL, NULL, 0) = 35
[17285] select(48, [47], [], NULL, {1, 0}) = 1 (in [47], left {0, 999997})
[17285] recvfrom(47, "<MC>"..., 32768, 0, NULL, NULL) = 7280
[17287] poll([{fd=44, events=POLLIN}], 1, 5000 <unfinished ...>
[17292] poll([{fd=44, events=POLLIN}], 1, 5000 <unfinished ...>
[17287] <... poll resumed> ) = 1 ([{fd=44, revents=POLLIN}])
[17285] connect(47, ...(8888)...("127.0.0.1")...) = -1 EINPROGRESS (...)
[17285] poll([{fd=47, events=POLLOUT}], 1, 1000) = 1 ([{fd=47, revents=POLLOUT}])
[17285] sendto(47, "GET /<URL>"..., 380, MSG_NOSIGNAL, NULL, 0) = 380
[17285] poll([{fd=47, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
[17285] poll([{fd=47, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
[17285] poll([{fd=47, events=POLLIN|POLLPRI}], 1, 1000 <unfinished ...>
[17292] <... poll resumed> ) = 1 ([{fd=44, revents=POLLIN}])
[17292] sendto(46, "\24\0...", 20, ...) = 20
[17292] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17292] poll([{fd=46, events=POLLIN|POLLOUT|POLLERR|POLLHUP}], 1, 1000) = 1 ([{fd=46, revents=POLLOUT}])
[17292] select(47, [46], [46], NULL, {1, 0}) = 1 (out [46], left {0, 999996})
[17292] sendto(46, "<MC>"..., 35, MSG_NOSIGNAL, NULL, 0) = 35
[17292] select(47, [46], [], NULL, {1, 0}) = 1 (in [46], left {0, 999997})
[17292] recvfrom(46, "<MC>"..., 32768, 0, NULL, NULL) = 7280
[17292] sendto(46, "\24\0...", 20, ...) = 20
[17292] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17292] poll([{fd=46, events=POLLIN|POLLOUT|POLLERR|POLLHUP}], 1, 1000) = 1 ([{fd=46, revents=POLLOUT}])
[17292] select(47, [46], [46], NULL, {1, 0}) = 1 (out [46], left {0, 999997})
[17292] sendto(46, "<MC>"..., 37, MSG_NOSIGNAL, NULL, 0) = 37
[17292] select(47, [46], [], NULL, {1, 0}) = 1 (in [46], left {0, 999997})
[17292] recvfrom(46, "<MC>"..., 32768, 0, NULL, NULL) = 22479
[17287] sendto(46, "\24\0...", 20, ...) = 20
[17287] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17287] poll([{fd=46, events=POLLIN|POLLOUT|POLLERR|POLLHUP}], 1, 1000) = 1 ([{fd=46, revents=POLLOUT}])
[17292] sendto(46, "\24\0...", 20, ...) = 20
[17287] select(47, [46], [46], NULL, {1, 0}) = 1 (out [46], left {0, 999997})
[17287] sendto(46, "<MC>"..., 35, MSG_NOSIGNAL, NULL, 0) = 35
[17287] select(47, [46], [], NULL, {1, 0}) = 1 (in [46], left {0, 999997})
[17287] recvfrom(46, "<MC>"..., 32768, 0, NULL, NULL) = 7280
[17292] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17292] poll([{fd=46, events=POLLIN|POLLOUT|POLLERR|POLLHUP}], 1, 1000) = 1 ([{fd=46, revents=POLLOUT}])
[17292] select(47, [46], [46], NULL, {1, 0}) = 1 (out [46], left {0, 999997})
[17287] sendto(46, "\24\0...", 20, ... <unfinished ...>
[17292] sendto(46, "<MC>"..., 35, MSG_NOSIGNAL, NULL, 0 <unfinished ...>
[17287] <... sendto resumed> ) = 20
[17292] <... sendto resumed> ) = 35
[17292] select(47, [46], [], NULL, {1, 0}) = 1 (in [46], left {0, 999997})
[17292] recvfrom(46, "<MC>"..., 32768, 0, NULL, NULL) = 7280
[17287] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17287] poll([{fd=46, events=POLLIN|POLLOUT|POLLERR|POLLHUP}], 1, 1000) = 1 ([{fd=46, revents=POLLOUT}])
[17287] select(47, [46], [46], NULL, {1, 0}) = 1 (out [46], left {0, 999997})
[17287] sendto(46, "<MC>"..., 37, MSG_NOSIGNAL, NULL, 0) = 37
[17287] select(47, [46], [], NULL, {1, 0}) = 1 (in [46], left {0, 999996})
[17287] recvfrom(46, "<MC>"..., 32768, 0, NULL, NULL) = 22479
[17292] sendto(46, "\24\0...", 20, ...) = 20
[17292] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17292] poll([{fd=46, events=POLLIN|POLLOUT|POLLERR|POLLHUP}], 1, 1000) = 1 ([{fd=46, revents=POLLOUT}])
[17292] select(47, [46], [46], NULL, {1, 0}) = 1 (out [46], left {0, 999997})
[17292] sendto(46, "<MC>"..., 35, MSG_NOSIGNAL, NULL, 0) = 35
[17292] select(47, [46], [], NULL, {1, 0}) = 1 (in [46], left {0, 999997})
[17292] recvfrom(46, "<MC>"..., 32768, 0, NULL, NULL) = 7280
[17292] flock(46, LOCK_EX <unfinished ...>
[17287] sendto(46, "\24\0...", 20, ...) = 20
[17287] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17287] poll([{fd=46, events=POLLIN|POLLOUT|POLLERR|POLLHUP}], 1, 1000) = 1 ([{fd=46, revents=POLLOUT}])
[17287] select(47, [46], [46], NULL, {1, 0}) = 1 (out [46], left {0, 999997})
[17287] sendto(46, "<MC>"..., 35, MSG_NOSIGNAL, NULL, 0) = 35
[17287] select(47, [46], [], NULL, {1, 0}) = 1 (in [46], left {0, 999997})
[17287] recvfrom(46, "<MC>"..., 32768, 0, NULL, NULL) = 7280
[17287] sendto(46, "\24\0...", 20, ...) = 20
[17287] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17287] poll([{fd=46, events=POLLIN|POLLOUT|POLLERR|POLLHUP}], 1, 1000) = 1 ([{fd=46, revents=POLLOUT}])
[17287] select(47, [46], [46], NULL, {1, 0}) = 1 (out [46], left {0, 999997})
[17287] sendto(46, "<MC>"..., 35, MSG_NOSIGNAL, NULL, 0) = 35
[17287] select(47, [46], [], NULL, {1, 0}) = 1 (in [46], left {0, 999997})
[17287] recvfrom(46, "<MC>"..., 32768, 0, NULL, NULL) = 7280
[17287] flock(46, LOCK_EX <unfinished ...>
[17289] <... poll resumed> ) = 0 (Timeout)
[17289] poll([{fd=47, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
[17289] poll([{fd=47, events=POLLIN|POLLPRI}], 1, 1000 <unfinished ...>
[17276] <... select resumed> ) = 0 (Timeout)
[17276] select(0, NULL, NULL, NULL, {1, 0} <unfinished ...>
[17284] <... poll resumed> ) = 0 (Timeout)
[17284] poll([{fd=47, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
[17284] poll([{fd=47, [...] <unfinished ...>
...
Not sure if the poll()
timeouts have anything to do with it.
To further generate the traffic pattern, and point out a strange coorelation with SIGCHLD
notices, here's strace
with just connect()
, recvfrom()
. I hit enter a few times after seeing no output for 5s (though most came in a consistent burst (though there were sometimes a couple of secs inbetween the GET /...
and the rest)) for clarity.
[17289] sendto(47, "GET /[BACKEND CALL]"..., 390, MSG_NOSIGNAL, NULL, 0) = 390
[17283] --- SIGCHLD (Child exited) @ 0 (0) ---
[17292] sendto(47, "\24\0...", 20, ...) = 20
[17292] connect(47, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17292] sendto(47, "[MC]"..., 37, MSG_NOSIGNAL, NULL, 0) = 37
[17292] recvfrom(47, "[MC]"..., 32768, 0, NULL, NULL) = 22479
[17288] sendto(47, "\24\0...", 20, ...) = 20
[17288] connect(47, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17288] sendto(47, "[MC]"..., 37, ...) = 37
[17288] recvfrom(47, "[MC]"..., ...) = 22479
[17288] sendto(47, "\24\0...", 20, ...) = 20
[17288] connect(47, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17288] sendto(47, "[MC]"..., 35, ...) = 35
[17288] recvfrom(47, "[MC]"..., ...) = 7280
[17288] sendto(47, "\24\0...", 20, ...) = 20
[17288] connect(47, ...(8888)...("127.0.0.1")...) = -1 EINPROGRESS (...)
[17288] sendto(47, "GET /[BACKEND CALL]"..., 380, ...) = 380
[17292] sendto(46, "\24\0...", 20, ...) = 20
[17292] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17292] sendto(46, "[MC]"..., 35, ...) = 35
[17292] recvfrom(46, "[MC]"..., ...) = 7280
[17292] sendto(46, "\24\0...", 20, ...) = 20
[17292] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17292] sendto(46, "[MC]"..., 37, ...) = 37
[17292] recvfrom(46, "[MC]"..., ...) = 22479
[17283] sendto(46, "\24\0...", 20, ...) = 20
[17283] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17292] sendto(46, "\24\0...", 20, ...) = 20
[17283] sendto(46, "[MC]"..., 35, ...) = 35
[17283] recvfrom(46, "[MC]"..., ...) = 7280
[17292] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17292] sendto(46, "[MC]"..., 35, ...) = 35
[17292] recvfrom(46, "[MC]"..., ...) = 7280
[17283] sendto(46, "\24\0...", 20, ...) = 20
[17283] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17283] sendto(46, "[MC]"..., 37, ...) = 37
[17283] recvfrom(46, "[MC]"..., ...) = 22479
[17292] sendto(46, "\24\0...", 20, ...) = 20
[17292] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17292] sendto(46, "[MC]"..., 35, ...) = 35
[17292] recvfrom(46, "[MC]"..., ...) = 7280
[17283] sendto(46, "\24\0...", 20, ...) = 20
[17283] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17283] sendto(46, "[MC]"..., 35, ...) = 35
[17283] recvfrom(46, "[MC]"..., ...) = 7280
[17283] sendto(46, "\24\0...", 20, ...) = 20
[17283] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17283] sendto(46, "[MC]"..., 35, ...) = 35
[17283] recvfrom(46, "[MC]"..., ...) = 7280
<Long Wait>
[17284] --- SIGCHLD (Child exited) @ 0 (0) ---
[17284] sendto(46, "\24\0...", 20, ...) = 20
[17284] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17284] sendto(46, "[MC]"..., 35, ...) = 35
[17284] recvfrom(46, "[MC]"..., ...) = 7280
[17284] sendto(46, "\24\0...", 20, ...) = 20
[17284] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17284] sendto(46, "[MC]"..., 37, ...) = 37
[17284] recvfrom(46, "[MC]"..., ...) = 22479
[17284] sendto(46, "\24\0...", 20, ...) = 20
[17284] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17284] sendto(46, "[MC]"..., 35, ...) = 35
[17284] recvfrom(46, "[MC]"..., ...) = 7280
[17284] sendto(46, "\24\0...", 20, ...) = 20
[17284] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17284] sendto(46, "[MC]"..., 35, ...) = 35
[17284] recvfrom(46, "[MC]"..., ...) = 7280
[17284] sendto(47, "\24\0...", 20, ...) = 20
[17284] connect(47, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17284] sendto(47, "[MC]"..., 37, ...) = 37
[17284] recvfrom(47, "[MC]"..., ...) = 22479
[17284] connect(47, ...(8888)...("127.0.0.1")...) = -1 EINPROGRESS (...)
[17284] sendto(47, "POST /[BACKEND CALL]"..., 412, ...) = 412
<Long Wait>
[17289] --- SIGCHLD (Child exited) @ 0 (0) ---
[17293] sendto(47, "\24\0...", 20, ...) = 20
[17293] connect(47, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17293] sendto(47, "[MC]"..., 37, ...) = 37
[17293] recvfrom(47, "[MC]"..., ...) = 22479
[17290] sendto(47, "\24\0...", 20, ...) = 20
[17290] connect(47, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17290] sendto(47, "[MC]"..., 37, ...) = 37
[17290] recvfrom(47, "[MC]"..., ...) = 22479
[17290] sendto(47, "\24\0...", 20, ...) = 20
[17290] connect(47, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17290] sendto(47, "[MC]"..., 35, ...) = 35
[17290] recvfrom(47, "[MC]"..., ...) = 7280
[17290] connect(47, ...(8888)...("127.0.0.1")...) = -1 EINPROGRESS (...)
[17290] sendto(47, "POST /[BACKEND CALL]"..., 427, ...) = 427
[17289] sendto(46, "\24\0...", 20, ...) = 20
[17289] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17289] sendto(46, "[MC]"..., 35, ...) = 35
[17289] recvfrom(46, "[MC]"..., ...) = 7280
[17289] sendto(46, "\24\0...", 20, ...) = 20
[17289] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17289] sendto(46, "[MC]"..., 37, ...) = 37
[17289] recvfrom(46, "[MC]"..., ...) = 22479
[17289] sendto(46, "\24\0...", 20, ...) = 20
[17289] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17289] sendto(46, "[MC]"..., 35, ...) = 35
[17289] recvfrom(46, "[MC]"..., ...) = 7280
[17289] sendto(46, "\24\0...", 20, ...) = 20
[17289] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17289] sendto(46, "[MC]"..., 35, ...) = 35
[17289] recvfrom(46, "[MC]"..., ...) = 7280
[17293] sendto(46, "\24\0...", 20, ...) = 20
[17293] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17293] sendto(46, "[MC]"..., 35, ...) = 35
[17293] recvfrom(46, "[MC]"..., ...) = 7280
[17293] sendto(46, "\24\0...", 20, ...) = 20
[17293] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17293] sendto(46, "[MC]"..., 37, ...) = 37
[17293] recvfrom(46, "[MC]"..., ...) = 22479
[17293] sendto(46, "\24\0...", 20, ...) = 20
[17293] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17293] sendto(46, "[MC]"..., 35, ...) = 35
[17293] recvfrom(46, "[MC]"..., ...) = 7280
[17293] sendto(46, "\24\0...", 20, ...) = 20
[17293] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17293] sendto(46, "[MC]"..., 35, ...) = 35
[17293] recvfrom(46, "[MC]"..., ...) = 7280
<Long Wait>
[17288] --- SIGCHLD (Child exited) @ 0 (0) ---
[17291] sendto(47, "\24\0...", 20, ...) = 20
[17291] connect(47, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17291] sendto(47, "[MC]"..., 37, ...) = 37
[17291] recvfrom(47, "[MC]"..., ...) = 22479
[17291] sendto(47, "\24\0...", 20, ...) = 20
[17291] connect(47, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17291] sendto(47, "[MC]"..., 35, ...) = 35
[17291] recvfrom(47, "[MC]"..., ...) = 7280
[17291] sendto(47, "\24\0...", 20, ...) = 20
[17291] connect(47, ...(8888)...("127.0.0.1")...) = -1 EINPROGRESS (...)
[17291] sendto(47, "GET /[BACKEND CALL]"..., 380, ...) = 380
[17288] sendto(46, "\24\0...", 20, ...) = 20
[17288] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17288] sendto(46, "[MC]"..., 35, ...) = 35
[17288] recvfrom(46, "[MC]"..., ...) = 7280
[17288] sendto(46, "\24\0...", 20, ...) = 20
[17288] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17288] sendto(46, "[MC]"..., 37, ...) = 37
[17288] recvfrom(46, "[MC]"..., ...) = 22479
[17288] sendto(46, "\24\0...", 20, ...) = 20
[17288] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17288] sendto(46, "[MC]"..., 35, ...) = 35
[17288] recvfrom(46, "[MC]"..., ...) = 7280
[17288] sendto(46, "\24\0...", 20, ...) = 20
[17288] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17288] sendto(46, "[MC]"..., 35, ...) = 35
[17288] recvfrom(46, "[MC]"..., ...) = 7280
<Long Wait>
[17284] --- SIGCHLD (Child exited) @ 0 (0) ---
[17284] sendto(47, "\24\0...", 20, ...) = 20
[17284] connect(47, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17284] sendto(47, "[MC]"..., 35, ...) = 35
[17284] recvfrom(47, "[MC]"..., ...) = 7280
[17284] connect(47, ...(8888)...("127.0.0.1")...) = -1 EINPROGRESS (...)
[17284] sendto(47, "GET /[BACKEND CALL]"..., 347, ...) = 347
<Long Wait>
...
Note that not a single one of these requests are "Web UI requests". Some things hit the backend directly, so this doesn't necessarily mean a frontend request would never get through. I found this post which sounds very similar to the issue I'm experiencing, but the solution (not making HTTP calls from within the requests) doesn't really apply. When watching the above output for a long period of time, I'll only ever see one "Web UI" request for every 100+ "Backend" requests, so I don't get the impression that it's some deadlock waiting on a backend call to return, but with only 10 MaxClients
, it's possible that there is at least 1 waiting (unsure; can't see /server-status
when this problem is presenting).
I'm pretty stuck at this point; never seen anything like this before. There's probably not enough information above to answer this properly, so if anyone can think of any other output or tests that would aid in debugging this, let me know and I'll edit. Any guidance would be very appreciated.