random CONNECTION_RESET on apache2.4 debian 9

Question

My server has some strange behaviour and I just cant find the cause. I've been looking everywhere.

I will pay 200$ worth of bitcoin to anyone who can figure this out.

The problem:

When requesting any resource from apache (page, image, css, js), it sometimes takes a very long time to respond. About half of the time, the connection gets reset. (on Chrome: net::ERR_CONNECTION_RESET) This happens rarely, randomly and is absolutely unpredictable. More confusingly, while the one request seems to hang, I can make additional requests that work perfectly.

About the server:

I'm running apache2.4 mpm-prefork with php7.0 on debian 9. The apache module uses mod_rewrite and an ssl-certificate from certbot. On some occasions, php invokes inkscape to render svgs to png.

The server load is very low (0.02) and nothing but apache runs on it.

Things checked:

checked all server logs. (syslog, apache log) - nothing
increased the apache mpm-prefork limits - nope
checked for possible DNS problems - nothing
I even moved to a completely new root server (on a different provider) - still the same

I went on and analyzed the tcp traffic with Wireshark, and there is some suspicious behavoir. When the connection is freezing, there are some TCP Out-of-Order, Retransmission and ACKed unseen segment packets... but I don't have the necessary low-level knowledge to tell what's going on.

Any hints would be greatly appreaciated!

EDIT:

This is the mpm_prefork config:

<IfModule mpm_prefork_module>
    StartServers            10
    MinSpareServers         10
    MaxSpareServers         50
    MaxRequestWorkers       300
    MaxConnectionsPerChild  0
</IfModule>

EDIT EDIT:

I had luck and got a tcp sniffer running on both server and client when it happened once again. Here are the pcap files, cropped to the last ~30 seconds.

serverside.pcap

clientside.pcap

If anyone with the knowledge could take a quick look at it and tell me what's going on, I'd be thrilled.

EDIT EDIT EDIT:

I managed to make the error reproducable, atleast with KeepAlive on. When a request is finished and the content is served, the tcp connection closes with a FIN-ACK after 5 seconds. When making another request in the time window of 5-12 seconds after the FIN-ACK, the connection freezes.

With KeepAlive off however, this doesnt happen anymore, tho the error occures even more often when loading multiple resources at the same time. But then it's not reproducable anymore.

Is your web server behind a reverse proxy? That could be causing the trouble. — Andrew Schulman, Jan 24 '18 at 16:14
looks like a networking issue. Even with a mpm_prefork a httpd server should never make you wait forever to serve, that would timeout, while a connection reset almost clearly indicates there is some networking related issue. So I am with your latest statement about suspicious behaviour, worth going that way. Check firewalling, iptables or the likes. — ezra-s, Jan 25 '18 at 13:52
Also if you really suspect there is an issue with httpd you should at least share your mpm settings configuration, although the description of the problem does not quite match a slow http server of any kind. — ezra-s, Jan 25 '18 at 13:53
@ezra-s Ok, I've added the mpm_prefork config just in case. The reason why I suspect it has something to do with httpd is because this problem never occurs using SSH or SFTP. The iptable is literally empty and all network configurations are as is, out of the box from debian. — Manuel Otto, Jan 25 '18 at 15:02
You should run `tcpdump` simultaneously on client and server and dump the packets to a file. Something like `tcpdump -pni eth0 'port 443' -s0 -Uw tcp-resets.pcap`. And once you have reproduced the problem start comparing the two packet captures in Wireshark. You need to find out if any of the packets are being dropped or corrupted, or if an intermediate device is sending ICMP errors or is injecting packets into your stream. — kasperd, Jan 28 '18 at 00:07
@kasperd Does tcpdump have to be running while the connection freezes or can I dump the log afterwards? The main problem about this is, that it's rare and about 90% of the time it works. Especially when the requests are consecutive. I can't simply reload the page until it happens. It appears to happen mostly on new connections... But i will try it, thanks for the tip. — Manuel Otto, Jan 28 '18 at 18:57
@ManuelOtto If you want to be sure to see what happened you need to have `tcpdump` running while you reproduce the problem. That means `tcpdump` needs to be started before the connection freeze and remain running until at least a few seconds after the connection froze. You can leave `tcpdump` running for days and filter the captured data afterwards if you have enough disk space to store the captured packets. But of course filtering the data will be tricky if you don't know what the exact problem you are looking for is. — kasperd, Jan 28 '18 at 19:21
Maybe check ulimit settings, however you mention that there are no errors in Apache logs. — Schrute, Jan 29 '18 at 17:27
Activate dropwatch while you are at it: https://github.com/pavel-odintsov/drop_watch — HackSlash, Jan 29 '18 at 21:20
Is there a hardware firewall in the path between your client and server? — c3st7n, Jan 31 '18 at 14:14
@c3st7n not that I know. The error occured on my old server aswell as the new one, which is hosted by a completely different provider, so I'd doubt it has something to do with hardware. — Manuel Otto, Jan 31 '18 at 14:36
@ManuelOtto Ah I see, I assumed both old and new servers were probably in the same network location. — c3st7n, Jan 31 '18 at 14:38
@kasperd I ran tcpdump and pretty much got the exact mirror of what I'm getting from running Wireshark on clientside. I noticed there are many 'TCP Dup Ack' packets, even when everything is working. Could that be a lead? Also some TLS 'Encrypted Alert'... I'm investigating on this one atm. — Manuel Otto, Jan 31 '18 at 14:40
@ManuelOtto Duplicate ACK packets can be triggered by packet loss in the data stream. Each data packet after a lost packet will trigger another ACK of the last packet before the loss. If you have packet captures from both ends of the connection you can see it that's what happened. — kasperd, Jan 31 '18 at 20:07
Try to test using worker instead of prefork. Also - check DNS. — Schrute, Feb 01 '18 at 23:28
Does the application use any external resources over an NFS mount or anything like that? — Schrute, Feb 05 '18 at 17:07
@ManuelOtto Do you have a capture that isn't limited to just the TCP traffic? If it is MTU related you may see ICMP packet too big messages. — c3st7n, Mar 21 '18 at 12:13

score 1 · Answer 1 · answered Feb 02 '18 at 20:13

I would check the size of the TCP packets going between the server and client. IF they are nearing 1500 in size there is a possibility they getting dropped for numerous possibilities:

If the DNF bit is set on the packet and the packet is getting fragmented somewhere this could be an issues that causes the packet to get dropped
If the MTU is set to 1500 and packets are going through tunnels, encryption, etc that causes additional headers to be added to the packet, then this would also cause your packets to drop. Try setting the mtu on both ends on the interfaces your are using to something lower than 1500, possibly 1420 or even lower.

I'm a bit hazy on what you suggested here. I will try to figure it out though. In the meantime, I posted the tcpdumps for both client and serverisde, could you be so kind and take a quick look at it? You really seem to know stuff. — Manuel Otto, Feb 02 '18 at 20:37

tanius · Answer 2 · 2019-01-27T01:22:58.537

Pretty sure I found the issue :-) as I just had the same thing happen to me.

1. Cause

I think you have TWO or more processes serving port 80 (or 443 if it's about SSL connections). You can check that as follows, here with the command for port 80 and the output from my system that had the problem:

# netstat -tupan | grep ":80.*LISTEN"

Proto Recv-Q Send-Q Local    Foreign  State   PID/Program name
                    Address  Address
tcp6       0      0 :::80    :::*     LISTEN  22718/apache2
tcp6       0      0 :::80    :::*     LISTEN  1794/apache2

Two processes serving the same IP addresses from the same port is indeed possible with port options SO_REUSEADDR and SO_REUSEPORT, see here and here (the section about "Linux >= 3.9").

What the kernel does with SO_REUSEPORT is to distribute incoming TCP connections to the processes serving that port, in a non-deterministic manner. One process is your Apache that serves the request properly, and one is "something else" that does not answer anything, ever. In my case, it was another Apache2 process.

2. Solution

If you have two Apache processes, first find out which of them is the "zombie". For that, stop your regular Apache server (service apache2 stop) and check which one remains (netstat -tupan | grep ":80.*LISTEN"). That's the "zombie". Note its PID.
To find out more about who or what started this "zombie" process:
- Execute cat /proc/<pid>/loginuid with the PID of that "zombie" process. If it shows 4294967295 it means that the system started it and not a user (reason). Otherwise, it's the UID of a user that you can look up.
- Execute ps auxf and determine the process uptime of your "zombie" process. If it matches the system uptime, it means that the process was started somehow at boot time.
To (perhaps) find out more about what is happening inside this "zombie" process, you can attach to it with strace. This will create a lot of hard to read logs, but since reproducing the problem of having this "zombie" process might not be easy it seems good to at least collect some of these logs (esp. of HTTP requests going to that process) before we kill the process. You would execute, with the PID of your process instead of $PID:
```
strace -o strace.log -f -p $PID
```
To solve the problem for the moment, kill the "zombie" process, supplying its PID for $PID: kill $PID or if needed kill -9 $PID.
Check if that "zombie" process is up and running again after a reboot, and if yes, you'll have to investigate and fix the cause of that.

3. Reproducing the cause

It is possible (but not trivial) to manually create an Apache2 "zombie" process that will run in parallel to the regular Apache server and just "answer nothing". Here are almost-but-not-quite complete instructions:

Create copies of relevant config files:

cp /etc/apache2/envvars /etc/apache2/envvars-zombie
cp /etc/apache2/apache2.conf /etc/apache2/apache2-zombie.conf

Edit /etc/apache2/envvars-zombie and at the beginning of the script statically set SUFFIX="-zombie", overriding the conditional assignment therein.
Edit /etc/apache2/apache2-zombie.conf and prevent the inclusion of any VirtualHost configuration files. In my case, I'd modify the corresponding line to be:
```
# IncludeOptional sites-enabled/
```
Make sure that default listen ports are included in your apache2-zombie.conf file. In my case this already happened via Include ports.conf.

Create lockfile and log dirs needed for the new instance of Apache2, and make them accessible by the user as which your new Apache2 will run:

mkdir /var/log/apache2-zombie
chown www-data /var/log/apache2-zombie/

mkdir /var/lock/apache2-zombie
chown www-data /var/lock/apache2-zombie/

Now you should be able to start your "zombie" Apache process as follows:

cd /etc/apache2/
source envvars-zombie
/usr/sbin/apache2 -f apache2-zombie.conf -k start

Confirm that there is now indeed a second process running on the Apache2 standard ports: netstat -tupan | grep ":80.*LISTEN".
That second Apache2 server is not yet a "zombie" because it will still answer "404 Not Found" or (since we did not setup SSL) result in a SSL error when making a request on port 443. But you can already observe the effect that a few requests go to this new server and result in these errors, in a non-deterministic manner. (I got up to this point in practice …)
To create a "proper" zombie Apache, set up a simple script that will accept a HTTP request and then do nothing (sleep()) for several minutes to let the browser give up resp. to let the TCP connection time out. Install it for the Apache default host. This way, it will be used for all HTTP requests to the port, since we disabled all VirtualHost configs so Apache cannot find a more suitable host for any request and will choose the default one.

Hey tanius, thanks a bunch for this comprehensive answer. I think this might actually be the answer. Thing is, it's been over a year and I cant confirm this is still happening on my server. I checked for the zombie process, and there was none. But that might be the exact reason why I cant reproduce it anymore. But i'd really love to know for sure...is there a way to spawn such a zombie process on purpose? — Manuel Otto, Jan 09 '19 at 01:42
@ManuelOtto I added instructions to create such an Apache2 process that will "answer nothing ever". Hope it helps you. — tanius, Jan 13 '19 at 22:15
tanius, apologies for the delayed response. I followed your steps but could not get the zombie-process to start. the start command on step 6 seems to fail silently and no process is spawned (running `... -k stop` returns httpd not running). Do I have to start the zombie as www-data? Currently am root. — Manuel Otto, Jan 18 '19 at 16:15
Starting as root is fine and even needed, as Apache has to bind to the port. After that, it will switch to run as the user told by its config file. I'm afraid I don't have a bright idea how you could still reproduce the issue. Maybe it's due to differences of our systems so that your case needs a slightly different approach. Ubuntu 16.10 on my server. — tanius, Jan 18 '19 at 16:27