How to avoid downtime with linux?

Question

Frequently software updates to Ubuntu require reboots (which can have side effects such as downtime).

I see Ubuntu has https://www.ubuntu.com/livepatch which allows for kernel updates without reboots, however, this is a paid service. There is also ksplice.

Are there Linux distributions/processes where upgrades/patches never require reboots?

(I know setting up high availability (HA) servers and having disposable servers are best practices - so I'm not asking about keeping a service up, but on actual servers.)

Would a air-gapped server work as a machine that never needs rebooting? After all, if no one can access it, you never need to reboot it? ;) -- For example, a monitoring server on a nuclear power plant, that simply sounds an alarm if something is wrong. (Yes i'm aware this would likely be a dedicated system rather than a random server, but I'm using the example just to make the point that there is occasions when rebooting for 'security updates' maybe an entirely fastidious idea. — djsmiley2kStaysInside, Feb 03 '19 at 14:40
@djsmiley2k That's one of those cases where a machine that you never reboot still doesn't give you sufficient availability. Instead you need redundancy. — kasperd, Feb 03 '19 at 15:13
@djsmiley2k My answer to the question already argues why I consider a cluster of machines that are rebooted one at a time to be more reliable than one which you never reboot. — kasperd, Feb 03 '19 at 17:36
What makes you think avoiding individual system downtime is preferable? — warren, Feb 04 '19 at 20:17
@warren that is to be determined. While it seems this is may not be feasible/possible if it was it would probably be better than the status quo. — user75126, Feb 05 '19 at 13:22
progress.. https://aws.amazon.com/blogs/apn/live-patching-linux-kernel-vulnerabilities-with-aws-and-kernelcare/ — user75126, Jul 10 '19 at 01:45

kasperd · Answer 1 · 2019-02-02T16:33:41.560

There is an important distinction between making a service highly available and making an individual machine highly available.

In most cases the goal is to make the service highly available, and availability of individual machines is only a means toward achieving that goal. However there is a limit in how far towards the goal you can get by improving availability of individual machines.

Even if you could take away all the downtime due to needing to update software the individual machines will still not be 100% available. Thus to increase the availability of the service above the availability of individual machines you have to design redundancy at a higher level. The last sentence of your question shows that at least in principle you know this.

If you do design a service to be more available than individual machines can deliver there is no longer pressure to achieve high availability of individual machines. Thus for highly available services there is no need to avoid reboots. Instead you can sacrifice some reliability of individual machines to make savings which can be put towards other areas where you can get much higher gains in reliability.

Once the high level system is design to be reliable in case of individual hardware components failing the live patching of kernels changes from being an advantage to becoming a risk.

It's a risk because there can be subtle differences between the behavior of a machine which was live patched and a machine which was booted with the newest kernel version. This can introduce a latent bug that can cause an outage next time a machine is rebooted. This risk is amplified by rebooting to get a clean slate being seen as a method to mitigate some outages.

One day you could have an outage where you think rebooting the machine might help. But as you reboot you are hit by the latent bug preventing the machine from coming back in the desired state. Live patching is not the only way such a latent bug can happen, it could as well happen due to something as mundane as a service having been enabled manually and never configured to start during boot, or having been configured to start too early such that it fails to come up due to unsatisfied dependencies.

For those reasons a highly available service may actually be easier to achieve with regular reboots of individual machines at a slow enough rate that you can detect problems and pause the sequence of reboots once problems do happen.

I liked your description of the risk; "patched vs booted with the newest kernel".. However, you didn't answer my question.. which I could rephrase, are there linux distros which ship with 'livepatch' out-of-the-box? — user75126, Feb 02 '19 at 18:39
@user75126 I see it as a feature which is more appropriate for client machines than for servers. Moreover asking which distributions support it sounds like a product recommendation question. To me that sounds like two reasons why rephrasing the question like that would make it off-topic for this site. — kasperd, Feb 02 '19 at 19:20
maybe. no?. let's see if the new question title gets the response I want :) — user75126, Feb 02 '19 at 21:01
@user75126 Oracle's Ksplice has a free trial, and a free tier for Ubuntu and Fedora desktops (only, but they don't really enforce this). The problem is that creating the live patches is difficult to automate, and even the parts that can be automated are also time consuming. Creating these patches is a relatively labor intensive operation, and it's reasonable for companies to charge for that. I looked into what it would take to create the live patches myself, and noped right out of there. I haven't got that kind of time in my day. — Michael Hampton, Feb 02 '19 at 21:55
@user75126 It's really bad practice on this site to change the question title and body in a way that invalidates an existing answer. If you wanted to ask a different question, then ask a different question. — Greg Schmit, Feb 03 '19 at 02:26
@MichaelHampton, you should submit your comment as an answer - it's the best response so far. — user75126, Feb 03 '19 at 22:32
@user75126 Thanks. I read your question, and I didn't think it was really an answer to it. I was merely commenting on why this is a paid service. — Michael Hampton, Feb 04 '19 at 00:28

Paul Gear · Accepted Answer · 2019-02-06T22:57:19.970

To your question, "Are there Linux distributions/processes where upgrades/patches never require reboots?", I'm not aware of any, and I'm highly doubtful that there ever will be any which are truly reboot-free. In addition to Michael Hampton's comment about why live patching is not an out-of-the-box experience anywhere, live patching also doesn't achieve the same result as rebooting.

An anecdote to illustrate this: I recently investigated a problem where one particular utility had started segfaulting on a large number of machines. I tried looking at the shared libraries which it used to see if anything recently upgraded had broken it; ldd said it wasn't an executable (even though when I pulled the same binary down to my laptop, ldd could see the shared library dependencies just fine). I tried stepping through it in gdb; it segfaulted before it even got to the first instruction.

Looking at the timing of the fault, I found that a Ksplice patch had been recently applied. I backed out the patch and the binary didn't segfault, then added it back in, and it started segfaulting again. Rebooting onto equivalently-patched kernel worked fine. It turned out to be a patch for 32-bit support which the Ksplice folks had not applied quite correctly. To their credit, they issued a fixed patch within a few hours and it was back to working correctly on our fleet without intervention.

Another example: the Meltdown/Spectre patches were so invasive that the Ubuntu kernel team decided that live patching was impractical and required people to reboot their systems into the fixed kernel before receiving live patches again.

We run a large fleet of physical and virtual servers at work, with a large number of both Ksplice and Canonical Livepatch systems. They've both been far more reliable than a lot of other software, but I would still rather see our services designed with a reboot-friendly architecture than rely on kernel live patching.

How to avoid downtime with linux?

2 Answers2