monit: how do I restart many Tomcats without overloading the server?

Question

My server has several separate Apache Tomcat instances that each take a lot of time and CPU to start. It is not possible to start all of them at the same time. This would generate too much I/O, each service would take much longer to start and the services might even fail to start because of internal timeouts.

Here is some pseudo code that describes what I want to do. How would I accomplish this with a monitrc file?

check process service01 with pidfile /var/run/service01.pid
    start program = "/usr/sbin/service service01 start" with timeout 60 seconds
    stop program  = "/usr/sbin/service service01 stop"
    if does not exist then
        wait a random number of seconds (between 2 and 5 minutes)
        if the cpu load is < 100% then
            start program
        else 
            do nothing (check again in the next cycle)

check process service02 with pidfile /var/run/service02.pid
....

This code block would be repeated for each of the 10 services.

The critical step is the random wait. Otherwise, if the server is idle and no service is running (for example after a 'killall -9 java'), monit would check all services, finds that the cpu load is low right now, and start all services at once.

It's not clear in the question whether you have one instance of Tomcat with multiple applications, or multiple instances? What is the OS? — sam_pan_mariusz, Aug 01 '15 at 19:46
I am using several Tomcat instances. I have clarified the question a bit. — nn4l, Aug 02 '15 at 07:18

sam_pan_mariusz · Answer 1 · 2015-08-03T19:59:03.017

You didn't tell much about your OS, I can only assume it's Linux (from kill -9 ... part). I also don't know much about monit, but assume it's a flexible solution allowing you to retry starting a service if it fails.

I assume that Tomcat instances are started with a shell startup script(s). Add somewhere in the beginning of these script(s):

# edit the 3 lines to set your limits
LOAD_THRESHOLD=0.75
LOCK_TIME=30
TIME_LIMIT=120

LOCK_FILE='/var/lock/tomcat-delay.lock'

if [ -z "${TOMCAT_NOLOCK}" ]; then
    # simple locking mechanism to avoid simultaneous start of instances
    if [ -f "${LOCK_FILE}" ] && [ $(cat "${LOCK_FILE}") -gt $(date '+%s') ]; then
        exit 1
    else
        expr $(date '+%s') + ${LOCK_TIME} 1>"${LOCK_FILE}"
    fi
fi

T_TIME=0
while true; do
    # check for non-empty TOMCAT_NOWAIT
    if [ -n "${TOMCAT_NOWAIT}" ]; then
        break 1
    fi
    read T_LOAD60 T_REST </proc/loadavg
    # check current 60 sec. average value for system load
    if expr ${T_LOAD60} '<' ${LOAD_THRESHOLD} 1>/dev/null; then
        break 1
    fi
    # check for timeout
    if [ ${T_TIME} -ge ${TIME_LIMIT} ]; then
        # change to 'exit 1' to fail on timeout instead of proceeding
        break 1
    fi
    sleep 1s
    echo -n '.'
    T_TIME=$((${T_TIME} + 1))
done

The above code doesn't really check the CPU load only but rather the system load average, which by design includes all the factors that may slow down your performance. TIME_LIMIT is in secs. The script will finally try to start your service if load will not fall below given threshold in given time - the final break 1 part can be changed to exit 1 to abort startup and tell monit daemon to retry.

If you try to start a service manually (not from monit), it will also wait, which I consider an advantage. You can export env TOMCAT_NOWAIT with a non-empty value to avoid it.

Edit #1: added simple locking mechanism as a workaround to simultaneous instances startup problem. The non-empty env TOMCAT_NOLOCK disables locking. Set LOCK_TIME to the warming-up time of the instances, so the high load is detected properly.

This doesn't work. If you have several Tomcat instances, each of them started with this script, then at boot time all of these Tomcats would be started at roughly the same time. Each of them sees the same loadavg value which is still low right after booting, so each of them will continue starting Tomcat immediately - which will crash the system (in my case). — nn4l, Aug 03 '15 at 15:07
I have assumed that *monit* starts only one service at a time (I'm no expert with this program). Can it be configured that way? If no, I could add a simple locking mechanism to the code, which would be better anyway - more universal. — sam_pan_mariusz, Aug 03 '15 at 18:51
that was my question - how to start one service at a time with monit, depending on the load. I am right now using a script similar to yours (but using flock) as a workaround. — nn4l, Aug 04 '15 at 02:50

score 0 · Accepted Answer · answered Sep 17 '15 at 08:56

I have now figured out a setup that does the job. After restart or after failure of several processes, the CPU load is checked, and each service is started only after the CPU load is below 1 or after a long delay. The scripts below work just fine in my environment:

Edit /etc/monit/monitrc:

...
## Start Monit in the background (run as a daemon):
#
set daemon 120              # check services at 2-minute intervals
    with start delay 240    # optional: delay the first check by 4-minutes (by
#                           # default Monit check immediately after Monit start)

For each service, add this to /etc/monit/conf.d:

check process myname with pidfile /var/run/app0000.pid
    start program = "/usr/sbin/service app0000 start" with timeout 60 seconds
    stop program  = "/usr/sbin/service app0000 stop"
    if does not exist then exec "/root/bin/service_with_delay app0000 start"

Create script /root/bin/service_with_delay:

#!/bin/bash
(
  # Wait for lock on /var/lock/service_with_delay.lock (fd 9)
  flock -n 9 || exit 1

  for i in `seq 1 10`; do

    # start the service if the cpu load is < 1.0 or after waiting for 300 seconds

    read load ignore </proc/loadavg
    flag=`expr ${load} '<' 1`
    if [ ${flag} -eq 1 ] || [ ${i} -eq 10 ]; then

        echo `date` service_with_delay $1: pid $$ load ${load} i ${i} - starting >> /var/log/service_with_delay.log
        /usr/sbin/service $1 start

        # make sure next script getting the lock sees some load
        sleep 60
        break
    fi

    # wait
    echo `date` service_with_delay $1: pid $$ load ${load} i ${i} >> /var/log/service_with_delay.log
    sleep 30
  done
) 9> /var/lock/service_with_delay.lock

monit: how do I restart many Tomcats without overloading the server?

2 Answers2