My server has some strange behaviour and I just cant find the cause. I've been looking everywhere.
I will pay 200$ worth of bitcoin to anyone who can figure this out.
The problem:
When requesting any resource from apache (page, image, css, js), it sometimes takes a very long time to respond. About half of the time, the connection gets reset. (on Chrome: net::ERR_CONNECTION_RESET) This happens rarely, randomly and is absolutely unpredictable. More confusingly, while the one request seems to hang, I can make additional requests that work perfectly.
About the server:
I'm running apache2.4 mpm-prefork with php7.0 on debian 9. The apache module uses mod_rewrite and an ssl-certificate from certbot. On some occasions, php invokes inkscape to render svgs to png.
The server load is very low (0.02) and nothing but apache runs on it.
Things checked:
- checked all server logs. (syslog, apache log) - nothing
- increased the apache mpm-prefork limits - nope
- checked for possible DNS problems - nothing
- I even moved to a completely new root server (on a different provider) - still the same
I went on and analyzed the tcp traffic with Wireshark, and there is some suspicious behavoir. When the connection is freezing, there are some TCP Out-of-Order, Retransmission and ACKed unseen segment packets... but I don't have the necessary low-level knowledge to tell what's going on.
Any hints would be greatly appreaciated!
EDIT:
This is the mpm_prefork config:
<IfModule mpm_prefork_module>
StartServers 10
MinSpareServers 10
MaxSpareServers 50
MaxRequestWorkers 300
MaxConnectionsPerChild 0
</IfModule>
EDIT EDIT:
I had luck and got a tcp sniffer running on both server and client when it happened once again. Here are the pcap files, cropped to the last ~30 seconds.
If anyone with the knowledge could take a quick look at it and tell me what's going on, I'd be thrilled.
EDIT EDIT EDIT:
I managed to make the error reproducable, atleast with KeepAlive on. When a request is finished and the content is served, the tcp connection closes with a FIN-ACK after 5 seconds. When making another request in the time window of 5-12 seconds after the FIN-ACK, the connection freezes.
With KeepAlive off however, this doesnt happen anymore, tho the error occures even more often when loading multiple resources at the same time. But then it's not reproducable anymore.