2

I have bunch of services responsible for running action consumed from the queue.

I want to be able to restart services gently (without interrupting action which're already running)

It can be solved with handling SIGTERM sent by systemd and saving information that the program should exit after current action's processed.
There's other minor problem with the fact that after some time defined as TimeoutStopSec in service configuraiton file systemd will send additional SIGKILL to brutally terminate my process.
I can easily avoid it with setting TimeoutStopSec=infinity. Then systemctl stop'll wait until script terminates itself, which may last even more than an hour, and leads me to main problem.

I don't want systemctl command to wait until script will end

It looks like SendSIGKILL=no configuration does the job. This results with retrying SIGTERM after TimeoutStopSec, then creating new worker, and leaving the old one running.

journalctl log

May 06 14:14:43 jaku systemd[1]: Stopping Jaku test worker...
May 06 14:14:43 jaku python3[31597]: * 15 <frame object at 0x14d8108>
May 06 14:14:53 jaku systemd[1]: jaku-test-worker.service: State 'stop-sigterm' timed out. Skipping SIGKILL.
May 06 14:14:53 jaku python3[31597]: * 15 <frame object at 0x14d8108>
May 06 14:15:03 jaku systemd[1]: jaku-test-worker.service: State 'stop-final-sigterm' timed out. Skipping SIGKILL. Entering failed mode.
May 06 14:15:03 jaku systemd[1]: jaku-test-worker.service: Failed with result 'timeout'.
May 06 14:15:03 jaku systemd[1]: Stopped Jaku test worker.
May 06 14:15:03 jaku systemd[1]: jaku-test-worker.service: Found left-over process 31597 (python3) in control group while starting unit. Ignoring.
May 06 14:15:03 jaku systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
May 06 14:15:03 jaku systemd[1]: Started Jaku test worker.
jaku@jaku:/nfs/home/jaku/tmp$ ps aux | grep "sig.py"
jaku     31597 99.9  0.0  31884  9916 ?        Rs   14:00  15:10 /usr/bin/python3 /home/jaku/tmp/sig.py
jaku     32359  100  0.0  31884 10032 ?        Rs   14:15   0:43 /usr/bin/python3 /home/jaku/tmp/sig.py
jaku     32483  0.0  0.0  15968  1040 pts/7    S+   14:15   0:00 grep --color=auto sig.py

Solution looks like it's doing its job, but I'm worried about this sentence:

This usually indicates unclean termination of a previous run, or service implementation deficiencies.

Am I missing something or it's really the best solution?

Reference:

Jakub Kuszneruk
  • 151
  • 1
  • 7

3 Answers3

1

systemd's idea of stopping a service is that all processes associated with that unit's cgroup are terminated. After running any ExecStop= and then sending KillSignal=, and finally if necessary FinalKillSignal=. Seems reasonable to me.

Your software is handling a SIGTERM, leaving alive processes, then the unit is configured to not send SIGKILL. systemd considers this broken, the warning implies "service implementation deficiencies". It didn't stop.

I don't want systemctl command to wait until script will end

Then shut down within a minute or so. Users of a service don't want to wait for it to shut down; DefaultTimeoutStopSec= is probably 90s. While your service unit can increase TimeoutStopSec=, I would consider an hour an unreasonable time to wait for a thing to stop in an init script.

If you have a (synchronous) stop script, implement it as ExecStop=. If not, handle SIGTERM as a graceful shut down immediately. Leave SIGKILL enabled as a final last resort of stopping it.


Other ways exist to stop a service from getting work other than killing its processes. For example, removing it from a load balancer and draining load.

John Mahowald
  • 30,009
  • 1
  • 17
  • 32
1

It looks like there's no way around it, here are some related threads.
But... My assumption that I don't want systemctl command to wait until script will end was wrong.

I wanted this command to be short, because it had to be part of jenkins deployment, and I didn't want deploy process to take more than several minutes.
What I didn't know is that interrupting systemctl command doesn't stop process of turning service down, so possible solution is:

running systemctl command with time limit e.g. timeout 60 systemctl restart services-preifx-* || echo "processes will be restart in background"

Now TimeoutStopSec can be set to some high value (like 10h) to prevent service restarting forever.

Additionally KillMode=process must be set, not to interrupt any child processes.

Jakub Kuszneruk
  • 151
  • 1
  • 7
1

It sounds like you'd want to run systemctl with the --no-block argument:

systemctl --no-block stop service-name

--no-block

Do not synchronously wait for the requested operation to finish. If this is not specified, the job will be verified, enqueued and systemctl will wait until the unit's start-up is completed. By passing this argument, it is only verified and enqueued. This option may not be combined with --wait.

Lii
  • 113
  • 4