22

I'm trying to start a program (Resque) but it takes a bit of time before a pidfile is written. Thus, I think that Monit thinks the program hasn't started and starts one or two more programs before the before the pidfile of the first one is written.

How do I delay the time Monit checks again, just for this process? Or should I solve this in another way?

Ramon Tayag
  • 469
  • 2
  • 7
  • 17
  • I added a new answer below. Although waiting longer between checks will prevent collisions for slow services, it can be a really bad experience for customers. – Eddie Mar 21 '15 at 21:14

5 Answers5

20

You can check a specific service on a different interval than the default...

See SERVICE POLL TIME in the Monit documentation.

An example for your Resque program would be to check on a different number of cycles:

check process resque with pidfile /var/run/resque.pid
   every 5 cycles

or from the examples section:

Some servers are slow starters, like for example Java based Application Servers. 
So if we want to keep the poll-cycle low (i.e. < 60 seconds) but allow some services to take its time to start, 
the every statement is handy:

 check process dynamo with pidfile /etc/dynamo.pid every 2 cycles
       start program = "/etc/init.d/dynamo start"
       stop program  = "/etc/init.d/dynamo stop"
       if failed port 8840 then alert

or you can leverage the cron-style checks.

check process resque with pidfile /var/run/resque.pid
   every 10 * * * *

or if you're experiencing a slow startup, you can extend the timeout in the service start command:

check process apache with pidfile /var/run/httpd.pid
       start program = "/etc/init.d/httpd start" with timeout 90 seconds
ewwhite
  • 194,921
  • 91
  • 434
  • 799
11

How do I delay the time Monit checks again, just for this process?


What you are trying to achieve could be done via "SERVICE POLL TIME" feature of monit

Monit documentation says

Services are checked in regular intervals given by the

set daemon n

statement. Checks are performed in the same order as they are written in the .monitrc file, except if dependencies are setup between services, in which case the services hierarchy may alternate the order of the checks.

One of the method to customize service poll is

  1. custom interval based on poll cycle length multiple

EVERY [number] CYCLES

Example:

check process resque with pidfile /your/app/root/tmp/pid/resque.pid
   every 2 cycles

Or should I solve this in another way?


I also did initial attempt to monitor resque jobs with monit because monit is a very lightweight daemon but eventually settled with GOD. I know , I know GOD is more resource hungry in comparison to monit but in case of resque we found it to be a good match.

kaji
  • 2,510
  • 16
  • 17
9

You can also check if something has failed for X times straight:

 if failed 
    port 80 
    for 10 cycles 
 then alert

Or for X times within Y polls:

 if failed 
    port 80
    for 3 times within 5 cycles 
 then alert

Or both:

 check filesystem rootfs with path /dev/hda1
  if space usage > 80% for 5 times within 15 cycles then alert
  if space usage > 90% for 5 cycles then exec '/try/to/free/the/space'

(from here)

Vaiden
  • 191
  • 1
  • 5
  • 1
    This is another very good answer, as it shows how you can check on the default interval, but only take action on a more forgiving basis. – RCross Apr 20 '17 at 16:09
2

A member of my team came up with a rather clever solution that allows monit to check frequently (every minute), but once it has attempted to restart the service (which takes ~10 minutes) it will wait a specified grace period before attempting to start again.

This prevents waiting too long between checks, which combined with slow start is a much larger impact to customers. It works by using an intermediate script that acts as flag to indicate monit is already taking action from the last failure.

check host bamboo with address bamboo.mysite.com
   if failed
           port 443 type tcpSSL protocol http
           and status = 200
           and request /about.action
            for 3 cycles
   then exec "/bin/bash -c 'ps -ef | grep -v "$$" | grep -v "grep" | grep restartBamboo.sh >/dev/null 2>&1; if [ $? -ne 0 ]; then /opt/monit/scripts/restartBamboo.sh; fi'"

If bamboo (slow starting web app) is down for 3 minutes in a row, restart, BUT only if a restart script is not already running.

The the script that is called has a specified sleep that waits LONGER then the slowest start time for the service (in our case we expect to finish in ~10, so we sleep for 15)

#!/bin/bash
echo "Retarting bambo by calling init.d"
/etc/init.d/bamboo stop
echo "Stopped completed, calling start"
/etc/init.d/bamboo start
echo "Done restarting bamboo, but it will run in background for sometime before available so, we are sleeping for 15 minutes"
sleep 900
echo "done sleeping"
Eddie
  • 123
  • 6
2

The current version of Monit (5.16) supports a timeout for the start scripts with the syntax:

 <START | STOP | RESTART> [PROGRAM] = "program"
    [[AS] UID <number | string>]
    [[AS] GID <number | string>]
    [[WITH] TIMEOUT <number> SECOND(S)]

The docs state:

In the case of a process check, Monit will wait up to 30 seconds for the start/stop action to finish before giving up and report an error. You can override this timeout using the TIMEOUT option.

Which is what the "timeout" value will do.

jeteon
  • 160
  • 8
  • Extending the timeout works if the actual start takes a long time but in the original question it sounds like the program may have started quickly (i.e. returned) but did not write out the PID immediately. Is there a way to tell monit to not check the service for a specified time after the restart? – PeterVermont Feb 09 '17 at 18:21
  • The `timeout` should apply to both starts and restarts. As far as I understand, it puts in a delay before Monit checks that its: a) running, b) the expected PID file is created and c) a process with the expected PID is currently running. I had some issues getting it to work where the specified application was just a script that forked the real process then returned without knowing what was happening with the process. Getting it to work in this case was a pain. – jeteon Feb 10 '17 at 22:16
  • what about the system is rebooted and starting the services ? is there any way to specify a initial delay, in seconds, for each check ? also the passive checks without start/stop statements – Massimo Aug 10 '18 at 18:45
  • I believe in that case you might be looking for `START DELAY`. – jeteon Aug 10 '18 at 20:48