Systemd and Disaster-Recovery Stand-By systems

Question

We're using systemd to run various services in production. (Duh...)

We're building out a matching "disaster-recovery" site, which will have the same application installed -- with the same systemd-units to bring up its various components in case of a disaster.

This DR-environment is "hot", ready to take over within a short time (the shorter the better) -- thus becoming production itself. Then, when the "disaster" is resolved, the other environment is to become the DR.

My question is, how to keep those systemd-services ready-to-start, but not actually starting until a certain condition becomes true?

In order to conclude, that a particular site is currently the primary (production), a command (amIthePrimary) needs to run and exit with a 0 exit-code. The check is easy and fast -- and can be performed as often as once a minute. But, because it requires running a command, there is no Condition for it provided by systemd.

Do I put that command into every unit's ExecPre, or will that become a noisy error, needlessly annoying the administrators? Do I put it into a unit of its own, with all other services Require-ing it?

Separately, once the condition is true -- and the services start -- how do I continue checking it, so they will all shut down, should it become false again?

A lot of services which have to start only right as you start needing them? Sounds like directly addressing reasons for not keeping those services running might result in a more reliable overall setup. I sure do like my services up and running, not just theoretically but verifiably (monitored just as the primary) standing by. — anx, Apr 21 '21 at 16:12
Upon starting, they will all attempt to talk to the database -- and fail, because it is in "replication" mode... We want to keep them down until the DB becomes "PRIMARY". — Mikhail T., Apr 21 '21 at 18:53

mattpr · Accepted Answer · 2021-11-22T08:59:27.517

As your use-case is pretty custom and your needs might change a in the future, why don't you do something like the following...

create a new systemd timer (e.g. failover-manager) on both machines that runs once per minute. The systemd timer will start an associated one-shot systemd service at regular intervals.

That one-shot systemd service can just run a bash script that contains your logic:

run your amIthePrimary check
if primary, start your systemd service and check that it started without error.
if not primary, stop your systemd service if it is running.
if your script is unable to start/stop/verify-running, then it should fail (monitorable), otherwise succeeds.
this timer script doesn't need to output anything (noise wise) except when it makes a change ie:
- "I am primary, but service not running. Starting... Waiting a few seconds... verified service is running without error." or
- "I am not the primary but the service IS running. Stopping."

This way you always know you can monitor that your regular/timer check is running without errors. If your timer-service has issues (non-zero exit) you can catch this with monitoring. You can monitor for failures of your main application service separately.

If your needs change you can easily adapt your timer script or the frequency it runs at.

There are probably cleaner ways to do this, but they would likely depend on an event being generated from whatever is behind your amIthePrimary check...and you didn't provide any details on that. i.e. event driven failover rather than polling.

You could also put your amIthePrimary check into ExecStartPre=... but when it fails to prevent the service from starting, your service will be in a FAILED state which may confuse your monitoring because it isn't a bad fail but rather an intentional fail. So you might prefer using the timer approach because then you can monitor your timer process and your main service process separately. timer should always be running, active and not failing. Your service (if running) should never be in a failed state or monitoring should go off. There is another question about how to know whether the service should be running or not from a monitoring perspective, but that is beyond the scope of the question.

Update - including sample implementation example

Untested, but just to make my suggestion more clear.

`failover-manager.sh`

Let's say this script is deployed to /opt/failover-manager/failover-manager.sh

#!/bin/bash

# expected ENV.  Provided by the service that starts this script.
#
# APP_SERVICE (your main application)
# SECONDS_TO_START (e.g. some java apps start very slowly)

if [ -z "$APP_SERVICE" -o -z "$SECONDS_TO_START" ]; then
    echo "Missing environment"
    exit 1
fi

function is_running {
    systemctl is-active --quiet $1
    return $?
}

if amIthePrimary; then
    if is_running $APP_SERVICE; then   # no change, no log
        exit 0
    else
        echo "I AM primary, but service NOT running.  STARTING..."
        systemctl start $APP_SERVICE
        sleep $SECONDS_TO_START
        if is_running $APP_SERVICE; then 
            echo "Verified service is STARTED without error: $APP_SERVICE."
            exit 0
        else
            echo "Service $APP_SERVICE has not yet STARTED after $SECONDS_TO_START seconds."
            exit 1
        fi
    fi
else
    if is_running $APP_SERVICE; then 
        echo "I am NOT primary, but service IS running.  Stopping..."
        systemctl stop $APP_SERVICE
        sleep $SECONDS_TO_START
        if is_running $APP_SERVICE; then 
            echo "Service $APP_SERVICE has not yet STOPPED after $SECONDS_TO_START seconds."
            exit 1
        else
            echo "Verified service is STOPPED: $APP_SERVICE."
            exit 0
        fi
    else   # no change, no log
        exit 0
    fi
fi

`failover-manager.timer`

[Unit]
Description=Timer that starts failover-manager.service
Requires=failover-manager.service

[Timer]
Unit=failover-manager.service
# every 1 minute
OnCalendar=*:0/1
AccuracySec=1s
Persistent=true


[Install]
WantedBy=timers.target

`failover-manager.service`

This guy is run by the timer above.

[Unit]
Description=Checks if we need to start or stop our application.

[Service]
Type=oneshot
Environment=APP_SERVICE="my-application.service" SECONDS_TO_START="5"    
WorkingDirectory=/opt/failover-manager/
ExecStart=/opt/failover-manager/failover-manager.sh

User=root
Group=root

pure systemd options?

If you are looking for a pure systemd mechanism to accomplish this in a clean way it may not be possible.

Your use-case is custom and IMO beyond the scope of systemd.

So you can "hack" it in using ExecStartPre or using requires/wants type dependency mechanisms...but all those approaches depend on a process either being in stopped state due to failure (breaks monitoring...is it an intended failure or something broken failure)... or that process being started/stopped by "something" that is aware of something outside the systemd world. The latter doesn't break monitoring but does require something beyond systemd and what I proposed is one way to do that.

alternatives

Like @anx suggested... perhaps re-engineering how your DR failover works.

This is also the approach we take. If we have a standby box/cloud/rack/etc, then we like to make sure everything is running already (e.g. services, etc).

Then the question is just... how to make the switch-over.

There are two common ways fail-over to a standby endpoint can be accomplished...

1 - DNS failover

Set a low DNS ttl (cache time) for your critical endpoints and update your DNS records to point at the standby endpoint (e.g. CNAME, A, AAAA DNS update) when a failure is detected.

Many managed DNS providers (e.g. dnsmadeeasy, dynect) offer this as part of their service (detection and fail-over). But of course you can implement this with your own DNS or any DNS provider that enables you to set a low TTL and easily manually or automatically (monitoring + DNS API) update your DNS records.

One potential issue here is that you may worry about bots making requests to the "non-active" endpoint. It will definitely happen but if your application is well designed it won't break anything to have a few requests coming into the standby DR endpoint.

The good thing is this forces you to think about how to make your application architecture more robust in terms of multiple concurrent endpoints receiving traffic (sharing databases, replication, etc).

If it is a big deal you can potentially add iptables rules to manage this...but then you may have the same problem as before...how to trigger the change (because now it is both DNS and iptables that need to have changes for the failover to happen).

2 - load balancer failover

It is quite common to have standby servers that are not active in a load balancer and can be quickly added/swapped into the pool of active servers behind the load balancer.

In this case the load balancer or a third component can manage the health checks and updating the load balancer config to swap healthy servers for unhealthy.

This doesn't work as well for a DR case as load balancers are usually rack or datacenter local. So for DR you probably are better of building on a DNS-based fail-over to a different data-center/region.

How would that timer cause the services to start upon detecting a state-change? Will it need to invoke commands directly, or can those other services depend on it somehow? Yes, I didn't like using the `ExecStartPre` -- because of the noise it will be generating 24x7... — Mikhail T., Nov 20 '21 at 05:14
The timer (and it's associated one-shot service unit...that is run by the timer unit at the specified interval) are (from a systemd perspective) completely unrelated to your application service unit. The connection is that the timer's one-shot service is running a bash (or other) script that implements the logic you want (as described in my answer)...and within that logic is starting or stopping your application's service as appropriate. This allows your disaster-recovery controlling timer/service to be monitored separately from your application service unit. — mattpr, Nov 20 '21 at 10:59
So the timer-fired command line will have to kick off all of real services explicitly, eh? I'm wondering, if it should just create (or delete) a file on the filesystem -- and all of the services will depend on that file's presence (or absence). — Mikhail T., Nov 21 '21 at 23:05
Depends. I try to think in terms of edge-cases. With this separate timer process...it will fail if the application should (not) be running but is not able to start (stop)... which can trigger a monitoring alert letting you know your failover-manager is having trouble. Separately you can monitor you application service. If state is failed (rather than stopped), then that is a problem. So you are covered. If you put a file on the filesystem and try to make things depend on that file's presence...how exactly do you monitor the system is behaving as expected? — mattpr, Nov 22 '21 at 08:51
No, no, my follow-up question is different... Suppose this separate time process finds a state change -- say, it decides, the services need to start. _How_ will it effect that? Will it call `systemctl enable`, for example, or will it cause a removal (or creation) of some file, that's listed (using `ConditionPathExists`) as dependency for all those services? — Mikhail T., Nov 22 '21 at 19:50
Did you see that I updated my answer the other day to add a sample implementation? In `failover-manager.sh` you can see I am directly starting the service and verifying it started after some time or I error. Of course you can `systemctl enable` it as well if that makes sense...it is just a sample. A couple things both me about file-on-disk... other processes could add/remove that magic file without you knowing and the timer is happy and your service isn't running...so no monitoring issue. What does your standby service state look like when the file is not there? Always failed? — mattpr, Nov 22 '21 at 20:19
Yes, I just noticed your update, thank you. I haven't used the `ConditionPathExists` myself yet, but I'd expect a service waiting a file using this mechanism to _not_ be "failed". — Mikhail T., Nov 22 '21 at 20:32
Yeah, I wasn't saying it wouldn't work. Just that there are some questions there. I don't have experience with ConditionPathExists but here is an answer related to that: https://serverfault.com/questions/767415/start-systemd-service-conditionally A comment points out that this condition is an "if" not a "while"...ie it doesn't wait for the condition and then start...it just doesn't start if the condition isn't true when starting. Most of my concerns are with ability to reliably monitor that the various components are doing their jobs and in the right state. — mattpr, Nov 22 '21 at 21:25

Systemd and Disaster-Recovery Stand-By systems

1 Answers1