Can systemd detect and kill hung processes?

16

7

While working on a solution which uses file locking, I believe my code is getting into a deadlock. I'm using systemd to kick off the process on system startup. Using alarm(3) is an option, but I was wondering if there is a way for systemd to detect hung processes and restart them?

Currently to circumvent this problem for now, I'm planning to look at journalctl output and if it doesn't change for a specified amount of time, then I would kill the process through a shell script.

Just wondering if there's a better way to monitor processes through systemd or otherwise.

freethinker

Posted 2013-12-16T10:06:42.487

Reputation: 3 160

Probably not. How do you tell if process is hung? What if you really need something like for(;;) do_something();? – mvp – 2013-12-16T10:13:21.127

4Strictly speaking, if your code hangs you should debug that problem. Killing it via systemd (supposing it can be done, which I do not believe) or in any other way is the proper thing to do as you debug it. But you just cannot leave it free to go into a deadlock. – MariusMatutiae – 2013-12-16T10:20:10.620

Answers

25

Yes; but first fix your buggy program before fiddling with systemd.

MariusMatutiae is quite correct. You have a problem with your program. It deadlocks. Fiddling with systemd isn't the answer. At best, it's a distraction. Fix your program so that it isn't broken. Direct your energies at the right thing.

That said, other people are going to come here because of the question title, rather than the question proper. For their benefit, here's the answer to the title, ignoring the question proper:

Yes, systemd can monitor dæmons and automatically restart them if they stop talking. Not just any old dæmons, though. As mvp notes, there's no way to know that a dæmon has hung (in this universe, where the halting problem is undecidable, at least). Neither systemd nor any other computer program will ever be capable of deducing from scratch that some random program thrown at them has deadlocked, or gone into an infinite loop, or whatever. The best that you'll get here is detecting that a dæmon hasn't performed a regular "heartbeat" operation within a required timespan.

Dæmons that take advantage of systemd's watchdog capabilities, therefore, have to be written to speak a systemd-specific protocol, the sd_notify protocol. This complicates the dæmon code a tad. It's complicated further because dæmons should, if written properly, check whether they've been invoked with the watchdog function enabled, as well.

A dæmon that speaks this protocol to make use of systemd's watchdog capability …

  • … must check for the WATCHDOG_USEC environment variable;
  • … must call sd_notify() continually and frequently, throughout its lifetime, with the WATCHDOG=1 option set, at an interval of about WATCHDOG_USEC/2 ("USEC" stands for microseconds) ;
  • … must have Type=notify set in its unit file;
  • … should have NotifyAccess=main (or =all) set in its unit file;
  • … must have WatchdogSec=seconds set in its unit file.
  • … must link with libsystemd-daemon.so

If you want to know the details of coding this, after reading the manual, make sure that you go to the right StackExchange. This is SuperUser. StackOverflow is over there.

Further reading

  • Lennart Poettering. 2011-04-12. Watchdogs. Freedesktop.org.

JdeBP

Posted 2013-12-16T10:06:42.487

Reputation: 23 855

2Of course, I have to fix the issue, my only intention was to have a temporary hack till I figure out the issue. Thanks for the detailed answer. – freethinker – 2013-12-16T22:17:12.643