3

I have already worked through AWS Elastic Beanstalk - Apache is restarting constantly

Our elastic beanstalk instances are reporting the following message in the error_log

[Mon Jun 26 22:01:01.878892 2017] [mpm_prefork:notice] [pid 8595] AH00173: SIGHUP received.  Attempting to restart
*** Error in (wsgi:wsgi)    ': double free or corruption (out): 0x00007f564cced560 ***

sometimes the error sequence will look more like this:

[Tue Jun 27 00:01:01.215260 2017] [:error] [pid 6429] [remote XX.XXX.XX.195:29773] mod_wsgi (pid=6429): Exception occurred processing WSGI script '/opt/python/current/app/site/settings/wsgi/__init__.py'.
[Tue Jun 27 00:01:01.215320 2017] [:error] [pid 6429] [remote XX.XXX.XX.195:29773] OSError: failed to write data
[Tue Jun 27 00:01:01.222407 2017] [:error] [pid 6430] [remote XX.XXX.XX.60:53313] mod_wsgi (pid=6430): Exception occurred processing WSGI script '/opt/python/current/app/site/settings/wsgi/__init__.py'.
[Tue Jun 27 00:01:01.222460 2017] [:error] [pid 6430] [remote XX.XXX.XX.60:53313] OSError: failed to write data
[Tue Jun 27 00:01:04.554810 2017] [core:warn] [pid 8595] AH00045: child process 7614 still did not exit, sending a SIGTERM
[Tue Jun 27 00:01:04.554850 2017] [core:warn] [pid 8595] AH00045: child process 7615 still did not exit, sending a SIGTERM
[Tue Jun 27 00:01:05.555958 2017] [mpm_prefork:notice] [pid 8595] AH00173: SIGHUP received.  Attempting to restart
*** Error in (wsgi:wsgi)    ': double free or corruption (out): 0x00007f5640cae900 ***
*** Error in (wsgi:wsgi)    ': double free or corruption (out): 0x00007f78649b7970 ***

This will go on, almost every hour. The common message being:

[Mon Jun 26 22:01:01.878892 2017] [mpm_prefork:notice] [pid 8595] AH00173: SIGHUP received.  Attempting to restart

I looked for the mpm_prefork module conf block... and there is not one, so all the defaults are being used.

I looked for the logrotation command being executed by elastic beanstalk

/var/log/httpd/* {
size 10M
missingok
notifempty
rotate 5
sharedscripts
compress
dateext
dateformat -%s
create
postrotate
    /sbin/service httpd reload > /dev/null 2>/dev/null || true
endscript
olddir /var/log/httpd/rotated
}

Pretty standard stuff. My understanding of reload is it attempts a graceful restart...

I am able to manually trigger the error message by executing sudo apachectl -k restart although I can not find where this would be run during the log rotation.

We have downstream services which appear to be throwing exceptions at the point this server hangs up all it's connections.

So my question is, what else could be causing the SIGHUP within the mpm_prefork during logrotate? As far as I can tell, this should not be happening outside of an error condition.

Apache/2.4.18 (Amazon) mod_wsgi/3.5 Python/3.4.3

samazi
  • 131
  • 1
  • 5
  • I would "guess" it is an external monitor checking the logs for the kind of exception errors you see in the error.log. Do all the restarts happen within 5 minutes of the exceptions you see? – ezra-s Jun 27 '17 at 08:57
  • Yes, exactly that, around 4/5 minutes – samazi Jun 27 '17 at 08:59
  • then check your crontab or similar for a log monitor – ezra-s Jun 27 '17 at 09:05
  • @ezra-s do you mean something other than the existing hourly logrotate? There is nothing else unless I am missing something? – samazi Jun 27 '17 at 10:08
  • yes, I meant something other, but your description matches the case where something else monitors logs and restarts if it find certain messages. It was just a hint, AFAIK apache won't restart itself ever, so.. it must be something else, external to it. – ezra-s Jun 27 '17 at 15:40
  • Yeah thanks, you are right, it should not restart. I had some advice from @GrahamDumpleton which suggests running mod_wsgi in daemon mode does not support graceful restarts. I just find it hard to believe we are the only ones running this configuration, it is the default config for a Django/mod_wsgi app – samazi Jun 28 '17 at 00:34

1 Answers1

3

As a brief, it looks like current Elastic Beanstalk logrotation configuration is broken, which causes service downtime, 504 Gateway Timeout. Let's take a look.

Reproduction

We create simplest Python WSGI application.

application.py

import time


def application(environ, start_response):
    # somewhat realistic response duration
    time.sleep(0.5)

    status = '200 OK'
    response_headers = [('Content-type', 'text/plain')]
    start_response(status, response_headers)
    return [b'Hello world!\n']

Zip it to application.zip. Then create Elastic Beanstalk Python application and environment, upload the archive. Make sure you use a key pair that you possess. Leave other settings default. Wait until it's done (several minutes).

ssh into underlying EC2 instance (see instance identifier in EB's log). Type (httpd's logrotate post-action, see below):

sudo /sbin/service httpd reload

Then on your machine run:

siege -v -b -c 10 -t 10S http://your-test-eb.you-aws-region.elasticbeanstalk.com/

And while it runs, repeat reload command a couple of times.

Then you're expected to see something like the following:

** SIEGE 3.0.8
** Preparing 10 concurrent users for battle.
The server is now under siege...
HTTP/1.1 200   0.63 secs:      13 bytes ==> GET  /
HTTP/1.1 200   0.65 secs:      13 bytes ==> GET  /
HTTP/1.1 200   0.64 secs:      13 bytes ==> GET  /
HTTP/1.1 200   0.60 secs:      13 bytes ==> GET  /
...

Here's what happens when you reload.

HTTP/1.1 504   0.06 secs:       0 bytes ==> GET  /
HTTP/1.1 504   0.07 secs:       0 bytes ==> GET  /
HTTP/1.1 504   0.08 secs:       0 bytes ==> GET  /
HTTP/1.1 504   0.10 secs:       0 bytes ==> GET  /
HTTP/1.1 504   0.11 secs:       0 bytes ==> GET  /
HTTP/1.1 504   0.66 secs:       0 bytes ==> GET  /
HTTP/1.1 504   0.19 secs:       0 bytes ==> GET  /
HTTP/1.1 504   0.20 secs:       0 bytes ==> GET  /
HTTP/1.1 504   0.09 secs:       0 bytes ==> GET  /

Then it recovers.

HTTP/1.1 200   1.25 secs:      13 bytes ==> GET  /
HTTP/1.1 200   1.24 secs:      13 bytes ==> GET  /
HTTP/1.1 200   1.26 secs:      13 bytes ==> GET  /
...

Lifting the server siege..      done.

Transactions:                 75 hits
Availability:              81.52 %
Elapsed time:               9.40 secs
Data transferred:           0.00 MB
Response time:              1.21 secs
Transaction rate:           7.98 trans/sec
Throughput:             0.00 MB/sec
Concurrency:                9.68
Successful transactions:      75
Failed transactions:          17
Longest transaction:        4.27
Shortest transaction:       0.06

Note that ELB doesn't seem to have any effect on the problem, and the same can be reproduced with two SSH sessions to underlying EC2 and (Amazon AMI doesn't have siege):

ab -v 4 -c 10 -t 10 http://your-test-eb.you-aws-region.elasticbeanstalk.com/

Cause

/etc/cron.hourly/cron.logrotate.elasticbeanstalk.httpd.conf

#!/bin/sh
test -x /usr/sbin/logrotate || exit 0
/usr/sbin/logrotate /etc/logrotate.elasticbeanstalk.hourly/logrotate.elasticbeanstalk.httpd.conf

/etc/logrotate.elasticbeanstalk.hourly/logrotate.elasticbeanstalk.httpd.conf

/var/log/httpd/* {
size 10M
missingok
notifempty
rotate 5
sharedscripts
compress
dateext
dateformat -%s
create
postrotate
    /sbin/service httpd reload > /dev/null 2>/dev/null || true
endscript
olddir /var/log/httpd/rotated
}

Notice postrotate. /sbin/service is just a System V wrapper for scripts in /etc/init.d/. Its man page says:

service runs a System V init script in as predictable environment as possible, removing most environment variables and with current working directory set to /.

Note that reload is not standard Apache maintenance command. It's the distro's downstream addition. Let's look in the init script, /etc/init.d/httpd. Relevant part follows:

reload() {
        echo -n $"Reloading $prog: "
        check13 || exit 1
        killproc -p ${pidfile} $httpd -HUP
        RETVAL=$?
        echo
}

As you can see it sends HUP signal to Apache, which is interpreted as Restart Now:

Sending the HUP or restart signal to the parent causes it to kill off its children like in TERM, but the parent doesn't exit. It re-reads its configuration files, and re-opens any log files. Then it spawns a new set of children and continues serving hits.

TERM explains 504s pretty well. But how it should have probably been done is Graceful Restart, as it also re-opens logs but doesn't terminate requests being served:

The USR1 or graceful signal causes the parent process to advise the children to exit after their current request (or to exit immediately if they're not serving anything). The parent re-reads its configuration files and re-opens its log files. As each child dies off the parent replaces it with a child from the new generation of the configuration, which begins serving new requests immediately.

...

The code was written to both minimize the time in which the server is unable to serve new requests (they will be queued up by the operating system, so they're not lost in any event) and to respect your tuning parameters.

Workaround

It's possible to use .ebextensions to replace /etc/logrotate.elasticbeanstalk.hourly/logrotate.elasticbeanstalk.httpd.conf. In the root directory create .ebextensions/10_logs.config with the following contents (basically replace "reload" with "graceful"):

files:
    "/etc/logrotate.elasticbeanstalk.hourly/logrotate.elasticbeanstalk.httpd.conf":
        mode: "000644"
        owner: root
        group: root
        content: |
            /var/log/httpd/* {
                size 10M
                missingok
                notifempty
                rotate 5
                sharedscripts
                compress
                dateext
                dateformat -%s
                create
                postrotate
                    /sbin/service httpd graceful > /dev/null 2>/dev/null || true
                endscript
                olddir /var/log/httpd/rotated
            }

And re-deploy your Elastic Beanstalk environment. Note, however with subsequent sub-second graceful restarts I was able to (sporadically) produce 503 Service Unavailable, which, though, is not the case with log rotations as with evenly spaced graceful restarts there was no error.

saaj
  • 330
  • 3
  • 11
  • I had the same problem, but as my application takes some time to restart, I have about a minute of downtime. Is there a good practice to avoir this downtime due to logrotate? – Jundiaius Jun 18 '18 at 14:03
  • 2
    @Jundiaius You can try to rewrite your `logrotate` configuration to use `copytruncate` instead of `create`. This way you don't need to restart your application in any way, but with a small chance for loosing several lines on rotation. – saaj Jun 18 '18 at 14:09