8

I have this runit service with run and log/run scripts properly working.

As it happens, the service itself can crash for external reasons and might not be able to start for many minutes. The default way that runit handles this situation is by restarting the service every couple of seconds. How do I change this behaviour?

My last insight was to add a check script and do some magic there, but it seems much more complicated than it should be. Is there a better simpler way?

jpbochi
  • 153
  • 8

3 Answers3

9

You should be rate-limiting your restarts in the ./finish file for that service, which is run upon abnormal termination. The ./finish script will receive the return code from ./run and from there you can determine what to do, etc. For that matter, you should have your ./finish script screaming loudly about the failures and sending notifications and jumping all around on fire...

Avery Payne
  • 14,326
  • 1
  • 48
  • 87
  • Thanks this is the right answer but unfortunately modern programmers using python, ruby, etc. seem to always write apps that don't pay any attention to unix signals and don't provide proper exit codes at all. – figtrap Mar 01 '18 at 19:30
  • 1
    Returned error codes apparently are "uncool" I guess? – Avery Payne Mar 01 '18 at 22:55
  • seems like it. I think its a great step backwards, myself. – figtrap Mar 09 '18 at 20:45
3

I'm not familiar with this facility, however, if it was my task to solve this problem, and a very short man page reading did not offer a simple knob to tune this behaviour, I'd do the following:

Either extend the existing service start script, or if that is cumbersome, insert a new start script into the chain (which in turn starts the original start script). Instead of starting the service right away, the new start script should check if the last start happened recently enough. This can be done by checking a signaling file created by the previous start. If the file does not exist, the script can go on and touch the file and start the service. If the file exists, the script should check if the file is old enough. If it is not old enough, it should wait (sleep) in a loop until the file gets old enough.

Something like this might work (waits at least 1 minute between restarts):

#!/bin/bash

SIGNALDIR=/tmp
SIGNALFILE=service.started

while /bin/true; do
        found=`find "${SIGNALDIR}" -maxdepth 1 -name "${SIGNALFILE}" -mmin -1 | wc -l`
        [ "${found}" -eq 0 ] && break
        echo "Waiting"
        sleep 10
done

touch "${SIGNALDIR}/${SIGNALFILE}"
original service start...
Laszlo Valko
  • 591
  • 6
  • 8
  • That's a good approach. As soon as I test it, I'll the script with any possible necessary corrections. – jpbochi Oct 06 '14 at 15:30
2

I'm really not a fan of init based process management (and runit is basically an init substitute). As yo uare discovering, simple-minded restarting of failed processes as soon as they die is not a particularly good strategy. I've used init to restart monit, but that's as far as it goes. (potentially OOM killer could kill monit).

So, I'd encourage you to look for a replacement rather than patch things up.

Monit is pretty old, but it does the job well, and I'm not aware of anything better having come along. It's got the nice feature of not needing to malloc more memory after start-up, so beats the hell out of anything written in a scripting language. The last thing you want is your process monitor dying because it can't get memory.

mc0e
  • 5,786
  • 17
  • 31
  • systemd, included in EL7 and _most_ other distributions, can natively handle this situation and a variety of similar situations with [a massive number of options](http://www.freedesktop.org/software/systemd/man/systemd.service.html) and mostly makes process managers like these obsolete. – Michael Hampton Apr 23 '15 at 18:57
  • 1
    There are a small handful of situations where systemd may be "too large" for the target environment. And the old method of "process management by restarting until running" has been mostly superseded by proper dependency resolution. See https://skarnet.org/software/s6-rc/ and https://jjacky.com/anopa/ for for examples. – Avery Payne Jun 05 '18 at 21:31