6

We have the following setup: An AWS Application ELB has a target group of 4 EC2 instances. On each of the EC2 instances, there is an Apache2 which forwards to a Tomcat.

It works brilliantly. Except sometimes, when the ELB seemingly refuses to speak to a healthy instance and responds 502 instead. This is what it looks like in the ELB access log:

https 2018-03-25T09:36:13.797244Z app/XXXXXX 178.162.220.236:50254 172.31.24.81:1443 0.001 0.031 0.000 204 204 374 210 "POST https://XXXXXX.com:443/api/event?apikey=XXXXXX HTTP/1.1" "-" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:eu-central-1:709725828154:targetgroup/XXXXXX "Root=1-5ab76d8d-2f54c360354a57532ccc55ef" "XXXXXX.com" "arn:aws:acm:eu-central-1:709725828154:certificate/91f68487-7f5d-4ed5-80b9-4a9839ebe0f1" 0
https 2018-03-25T09:36:14.169090Z app/XXXXXX 178.162.220.236:50270 172.31.22.59:1443 0.001 0.096 0.000 204 204 374 210 "POST https://XXXXXX.com:443/api/event?apikey=XXXXXX HTTP/1.1" "-" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:eu-central-1:709725828154:targetgroup/XXXXXX "Root=1-5ab76d8e-748fc1e34bee608baa1efa81" "XXXXXX.com" "arn:aws:acm:eu-central-1:709725828154:certificate/91f68487-7f5d-4ed5-80b9-4a9839ebe0f1" 0
https 2018-03-25T09:36:18.785982Z app/XXXXXX 178.162.220.236:50330 172.31.19.151:1443 0.001 0.034 0.000 204 204 374 210 "POST https://XXXXXX.com:443/api/event?apikey=XXXXXX HTTP/1.1" "-" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:eu-central-1:709725828154:targetgroup/XXXXXX "Root=1-5ab76d92-34ac7ac52b69c34fdfdbce33" "XXXXXX.com" "arn:aws:acm:eu-central-1:709725828154:certificate/91f68487-7f5d-4ed5-80b9-4a9839ebe0f1" 0
https 2018-03-25T09:36:28.457341Z app/XXXXXX 178.162.220.236:50808 172.31.28.152:1443 0.001 0.075 0.000 204 204 374 210 "POST https://XXXXXX.com:443/api/event?apikey=XXXXXX HTTP/1.1" "-" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:eu-central-1:709725828154:targetgroup/XXXXXX "Root=1-5ab76d9c-66b8f0a5c6be9ff6c098dfe4" "XXXXXX.com" "arn:aws:acm:eu-central-1:709725828154:certificate/91f68487-7f5d-4ed5-80b9-4a9839ebe0f1" 0
https 2018-03-25T09:36:30.615879Z app/XXXXXX 178.162.220.236:50900 172.31.24.81:1443 0.001 0.032 0.000 204 204 314 210 "POST https://XXXXXX.com:443/api/event?apikey=XXXXXX HTTP/1.1" "-" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:eu-central-1:709725828154:targetgroup/XXXXXX "Root=1-5ab76d9e-d1d310c5d55ad09295bb08b9" "XXXXXX.com" "arn:aws:acm:eu-central-1:709725828154:certificate/91f68487-7f5d-4ed5-80b9-4a9839ebe0f1" 0
https 2018-03-25T09:36:30.955285Z app/XXXXXX 178.162.220.236:50914 172.31.19.151:1443 0.001 0.091 0.000 204 204 374 210 "POST https://XXXXXX.com:443/api/event?apikey=XXXXXX HTTP/1.1" "-" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:eu-central-1:709725828154:targetgroup/XXXXXX "Root=1-5ab76d9e-079416df92b2cd27865c8b93" "XXXXXX.com" "arn:aws:acm:eu-central-1:709725828154:certificate/91f68487-7f5d-4ed5-80b9-4a9839ebe0f1" 0
https 2018-03-25T09:36:36.843389Z app/XXXXXX 178.162.220.236:51090 172.31.28.152:1443 0.001 0.081 0.000 204 204 374 210 "POST https://XXXXXX.com:443/api/event?apikey=XXXXXX HTTP/1.1" "-" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:eu-central-1:709725828154:targetgroup/XXXXXX "Root=1-5ab76da4-917b032b1fb8c1c0dc914354" "XXXXXX.com" "arn:aws:acm:eu-central-1:709725828154:certificate/91f68487-7f5d-4ed5-80b9-4a9839ebe0f1" 0
https 2018-03-25T09:36:40.432293Z app/XXXXXX 178.162.220.236:51184 172.31.24.81:1443 0.002 0.076 0.000 204 204 374 210 "POST https://XXXXXX.com:443/api/event?apikey=XXXXXX HTTP/1.1" "-" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:eu-central-1:709725828154:targetgroup/XXXXXX "Root=1-5ab76da8-1ee52e1fa688cc81071fc8d1" "XXXXXX.com" "arn:aws:acm:eu-central-1:709725828154:certificate/91f68487-7f5d-4ed5-80b9-4a9839ebe0f1" 0
https 2018-03-25T09:36:41.398389Z app/XXXXXX 178.162.220.236:51208 172.31.19.151:1443 0.001 0.033 0.000 204 204 374 210 "POST https://XXXXXX.com:443/api/event?apikey=XXXXXX HTTP/1.1" "-" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:eu-central-1:709725828154:targetgroup/XXXXXX "Root=1-5ab76da9-2ee889d47ead389675818ea6" "XXXXXX.com" "arn:aws:acm:eu-central-1:709725828154:certificate/91f68487-7f5d-4ed5-80b9-4a9839ebe0f1" 0
https 2018-03-25T09:36:41.775607Z app/XXXXXX 178.162.220.236:51216 172.31.28.152:1443 0.000 0.000 -1 502 - 314 293 "POST https://XXXXXX.com:443/api/event?apikey=XXXXXX HTTP/1.1" "-" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:eu-central-1:709725828154:targetgroup/XXXXXX "Root=1-5ab76da9-fad6ea5fa8cc956c24d65206" "XXXXXX.com" "arn:aws:acm:eu-central-1:709725828154:certificate/91f68487-7f5d-4ed5-80b9-4a9839ebe0f1" 0
https 2018-03-25T09:36:46.954552Z app/XXXXXX 178.162.220.236:50912 172.31.22.59:1443 0.001 16.118 0.000 204 204 293 210 "POST https://XXXXXX.com:443/api/event?apikey=XXXXXX HTTP/1.1" "-" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:eu-central-1:709725828154:targetgroup/XXXXXX "Root=1-5ab76d9e-9c0827e71b334dd908000419" "XXXXXX.com" "arn:aws:acm:eu-central-1:709725828154:certificate/91f68487-7f5d-4ed5-80b9-4a9839ebe0f1" 0
https 2018-03-25T09:36:50.750077Z app/XXXXXX 178.162.220.236:51188 172.31.22.59:1443 0.001 10.096 0.000 204 204 374 210 "POST https://XXXXXX.com:443/api/event?apikey=XXXXXX HTTP/1.1" "-" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:eu-central-1:709725828154:targetgroup/XXXXXX "Root=1-5ab76da8-f99e3281d1e0e2a73c7c4ba9" "XXXXXX.com" "arn:aws:acm:eu-central-1:709725828154:certificate/91f68487-7f5d-4ed5-80b9-4a9839ebe0f1" 0
https 2018-03-25T09:36:52.336104Z app/XXXXXX 178.162.220.236:51510 172.31.24.81:1443 0.001 0.115 0.000 204 204 374 210 "POST https://XXXXXX.com:443/api/event?apikey=XXXXXX HTTP/1.1" "-" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:eu-central-1:709725828154:targetgroup/XXXXXX "Root=1-5ab76db4-2640adae88a8639f196613e9" "XXXXXX.com" "arn:aws:acm:eu-central-1:709725828154:certificate/91f68487-7f5d-4ed5-80b9-4a9839ebe0f1" 0
https 2018-03-25T09:36:55.563397Z app/XXXXXX 178.162.220.236:51604 172.31.22.59:1443 0.000 0.104 0.000 204 204 374 210 "POST https://XXXXXX.com:443/api/event?apikey=XXXXXX HTTP/1.1" "-" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:eu-central-1:709725828154:targetgroup/XXXXXX "Root=1-5ab76db7-ecff4da4c9414383ce128eae" "XXXXXX.com" "arn:aws:acm:eu-central-1:709725828154:certificate/91f68487-7f5d-4ed5-80b9-4a9839ebe0f1" 0
https 2018-03-25T09:36:58.824749Z app/XXXXXX 178.162.220.236:51682 172.31.19.151:1443 0.001 0.034 0.000 204 204 374 210 "POST https://XXXXXX.com:443/api/event?apikey=XXXXXX HTTP/1.1" "-" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:eu-central-1:709725828154:targetgroup/XXXXXX "Root=1-5ab76dba-425a808b5efc4c3cb4cb7f53" "XXXXXX.com" "arn:aws:acm:eu-central-1:709725828154:certificate/91f68487-7f5d-4ed5-80b9-4a9839ebe0f1" 0
https 2018-03-25T09:37:09.004924Z app/XXXXXX 178.162.220.236:51994 172.31.28.152:1443 0.001 0.135 0.000 204 204 655 210 "POST https://XXXXXX.com:443/api/event?apikey=XXXXXX HTTP/1.1" "-" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:eu-central-1:709725828154:targetgroup/XXXXXX "Root=1-5ab76dc4-5d8dce64e6fcb2aa069e9018" "XXXXXX.com" "arn:aws:acm:eu-central-1:709725828154:certificate/91f68487-7f5d-4ed5-80b9-4a9839ebe0f1" 0
https 2018-03-25T09:36:11.146813Z app/XXXXXX 178.162.220.236:50212 172.31.24.81:1443 0.001 0.031 0.000 204 204 374 210 "POST https://XXXXXX.com:443/api/event?apikey=XXXXXX HTTP/1.1" "-" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:eu-central-1:709725828154:targetgroup/XXXXXX "Root=1-5ab76d8b-ce30a6a810a89110465a9944" "XXXXXX.com" "arn:aws:acm:eu-central-1:709725828154:certificate/91f68487-7f5d-4ed5-80b9-4a9839ebe0f1" 0

Note the 502 as the elb_status_code in the 10th line. Also note the -1 as the target_status_code.

There is absolutely no trace of that request neither in the apache2 logs, nor the tomcat logs. All four instances remain healthy throughout. The same IP (172.31.28.152 in this case) correctly processes requests before and after.

The sending system is under our control. All the requests are uniform, so it is not some malformed request not being processed correctly. They all look the same.

Any ideas why this could happen? I assume AWS do not sell faulty load balancers...

Jan Dörrenhaus
  • 161
  • 1
  • 7
  • 2
    The balancer holds idle connections open to the back-end, for reuse. If the back-end closes one of those connections during a narrow time window -- just as the balancer tries to reuse it for a subsequent request -- that would be one possible explanation. Verify that your keep-alive timeout on the web server is longer than the [connection idle time-out](https://docs.aws.amazon.com/elasticloadbalancing/latest/application/application-load-balancers.html#connection-idle-timeout) on the balancer? – Michael - sqlbot Apr 24 '18 at 18:54
  • That's an excellent idea! I have now set the KeepAlive of the ELB to 60 seconds, and to 120 seconds on the Apache. Let's see if this does the trick, I will update this question in a couple of days. – Jan Dörrenhaus Apr 25 '18 at 08:37
  • @Michael-sqlbot This seems to have been the solution. Now that we aligned the keep-alive timeouts, the problems are gone. Would you care to answer this question, so I can mark it as the solution and link it for posterity? – Jan Dörrenhaus Apr 27 '18 at 21:23

1 Answers1

3

The solution was in the comments. The ELB has a default connection KeepAlive of 60 seconds. Apache2 has a default connection KeepAlive of 5 seconds. If the 5 seconds are over, the Apache2 closes its connection and resets the connection with the ELB. However, if a request comes in at precisely the right time, the ELB will accept it, decide which host to forward it to, and in that moment, the Apache closes the connection. This will result in said 502 error code.

The solution is: When you have cascading proxies/LBs, either align their KeepAlive timeouts, or - preferrably - even make them a little longer the further down the line you get.

We set the ELB timeout to 60 seconds and the Apache2 timeout to 120 seconds. Problem gone.

Jan Dörrenhaus
  • 161
  • 1
  • 7
  • How would I configure this for Apache/2.4.33 (Amazon) on amazon linux? My connection idle timeout is 180. What would I make this for my apache instances? How do I tell if keep alive is already on? What are pros & cons? – Chris Muench Jun 20 '19 at 21:11
  • I managed to figure it out. You need to set https://httpd.apache.org/docs/2.4/mod/core.html#keepalive `KeepAlive` On and `KeepAliveTimeout` XXX in apache httpd.conf. This seems to solve problem – Chris Muench Jun 21 '19 at 14:17
  • It seemed to solve problem for a few hours then it came back...anything else that can cause this? – Chris Muench Jun 21 '19 at 17:20