Hot clone a living Linux service

Question

We need to hot clone a Linux service when it's alive, not just because of we can't reboot or something; it's just because of our special scenario (yeah, I've already read this answer, but it's a little bit different from mine Clone a working Linux server).

We have a calculation node, you can say an NLP calculating node which is running some models on it. When we start the node (with a service of course), the calculation will be horrible slow until we feed it several times. We called it warm-up.

Unfortunately, the warming-up job takes a long time for us to wait for (maybe our calculating finished before the node warmed up).

So, the problem comes, is there a stable way to hot clone out a Linux server to keep the node at the best performance so that we can clone and make it online in a shorter time?

Would visualising the machine, and taking a snap shot of the "warmed-up" state be any use? — TripeHound, Jan 31 '19 at 15:40
Do you understand _why_ this warm-up happens? For instance, it might be a side-effect of the file cache. But some answers to cloning machines discard the file cache, because a cache by definition can be reconstructed from the underlying original. — MSalters, Jan 31 '19 at 17:05
fork() is one way to create more processes on a given machine while saving whatever startup overhead. — Yet Another User, Jan 31 '19 at 18:45
thanks folks, @TripeHound, I've asked a friend of mine who works in VMWare, and he said it looks impossible for them to simply snapshot the "warmed-up" state, neither some mirror stuff. MSalters, I'm not 100% sure what happens during the warming up, but it looks like after the service's up, some lazy loading job works after the calculating job involves — chen steven, Feb 01 '19 at 02:25
@chensteven I'm no expert on VMs (especially under Linux) but my assumption was you would have to virtualise the server first (presumably while offline), then you could spin up the VM of it, let _that_ "warm-up" and then snap-shot the running VM. In the future, you would only need to restart the VM from the "warmed-up snapshot". — TripeHound, Feb 01 '19 at 14:37
Unaware of your background setup, but this smells like a situation where your server must never go down. This suggests that your host's kernel could be ancient and that updates have never been applied. Perhaps this is an indicator of a systemic design flaw that needs to be considered. — Criggie, Feb 01 '19 at 22:53

Michael Hampton · Answer 1 · 2019-01-31T16:01:47.047

28

Maybe you can't "hot clone" a whole server (you can, but only if it's a virtual machine), but you can freeze and restore a single process, with criu, Checkpoint/Restore in Userspace.

This allows you to save the program's internal state to disk and stop the program, and later, to restore the program to that state from the saved files.

To support your desired operation, you can copy the files representing the saved program to another server, and restore it there.

criu requires a recent kernel with various features compiled in, so older Linux distributions might not work. You can run criu check on a particular machine to determine if the prerequisites for criu are present.

edited Jan 31 '19 at 16:01

answered Jan 31 '19 at 13:50

Michael Hampton

237,123
42
477
940

it looks awesome and I'll do some tests on this, thanks bro – chen steven Feb 01 '19 at 02:35
From your experience, how well does this work in practice? Looking at the limitations criu lists (which are pretty much the ones I'd expect - this is a hard problem), I get the feeling this is unlikely to work with applications that weren't designed with this use case in mind. – James_pic Feb 01 '19 at 14:08
@James_pic It's been perhaps a year since I looked at it seriously, since I don't currently have a use for it. For a daemon that's just accepting connections and doing some computation (e.g. the OP's machine learning job, or a web server) it works pretty well. – Michael Hampton Feb 01 '19 at 14:23

score 12 · Answer 2 · answered Jan 31 '19 at 15:36

12

It may be a bit out of scope of your current environment, but the industry standard way of doing this is to virtualize your server. Many virtualization hosts (VMware, virtualbox, etc.) allow “snapshots” that save the state of a server, which can then be cloned into new instances. These new instances will have exactly the same state as the original, down to running processes. Of course you’ll want to make sure that the software that you’re running will still perform correctly in a virtual environment (CUDA/ GPU calculation springs to mind).

answered Jan 31 '19 at 15:36

cawwot

221
1
6

Virtualization is great, until the software (or its dependencies) requires an update, and does not provide a graceful reload mechanism. A VM snapshot or live migration is running the old code. – John Mahowald Jan 31 '19 at 21:30
It's both acceptive for me to run the project in a "real" machine or virtualization host, and we can take several ways to handle the "old" code stuff, maybe A/B test or rolling update .etc. But are you sure the snapshots can totally clone the warmed-up state of my working node? – chen steven Feb 01 '19 at 02:38
3

When you "live-migrate" a machine, it needs to be paused. While it is paused, its memory is copied 1:1 to another machine in a cluster, where it is unpaused -- intact. This can take some time depending on how much memory is in use, and how fast the network fabric is. You may be able to use this method if the amount of downtime it invokes is low enough for your needs. – Spooler Feb 01 '19 at 03:24
@chensteven I've most recently come from a virtualbox environment. That was some time ago, but from what I remember a running snapshot contains the exact state of the vm at the time the snapshot was taken, including running processes and the contents of the memory. This snapshot can then be cloned to a new vm, giving you two machines in exactly the same state. – cawwot Feb 01 '19 at 19:01

score 3 · Answer 3 · edited Jan 31 '19 at 23:54

The question you mention refer to a link, http://www.linuxfocus.org/English/March2005/article370.shtml, which describe all the ways I had imagined to do your requests.

That the options are there does not mean a lot to what is running on the server. You have to consider that all the files that could change in the cloning process could be inconsistent files on the target machine. On that post you provide they talk about databases, and cloning it like that do not give any insurance of data integrity.

It is not exactly clear what you meant with "until we feed it several times".

But if I understood well what you ask, you have to consider that in order to clone a system it needs the time to copy and calculate resources.

To perform an "ON/OF" or better called an active/backup environment, the server has to be properly configured in the cluster.

I'm sorry if is not the answer you expect, but the options you get are those.

It's my fault to make you a little confuse here, the "feed" stuff means, after my service start up, we need to invoke the calculation tasks several times to ensure the node is "warmed-up" into the top performance. So the problem here is like the dynamic clone or expansion for our living jobs as if the large numbers of requests hitting our system, we won't have enough time to set up new calculation nodes (the warming-up take too much time) to handle them, u know, just like the waves coming — chen steven, Feb 01 '19 at 02:45

score 1 · Answer 4 · answered Feb 01 '19 at 18:40

There are many potential issues with what you are trying to do, and of course as you know it would be best to take the server offline and clone it while no data is being dynamically stored.

However, what you seek to do is entirely plausible, as I have done it before. If you use dd you can clone the full server at the block level to another drive or another server. It will however take some additional setup on the new server, and you probably won't be able to simply turn the other off and the new one on. For us to understand this, we need to know a few things about your server hardware and software.

Firstly, in order to determine the best data strategy, it would be helpful to know what is updating regularly. Do you have an SQL server which is dynamically updating but have static content? Alternatively, do you have a team of developers over a subversioning system like git sending constant data updates to your content? Depending on what is updating will determine the best full course of action.

If for example, it is only the SQL which is updating regularly, then you can migrate to a new server while that server is live in the following manner:

dd to clone all data the new server.
Start setting up the new server, it may take some work especially if it is different hardware, but still may be faster than setting up from scratch.
It may also take some DNS changes, since you can't use the same DNS on another server if you need to work on the second server live while the first server is still live.
After the new server is complete and running independently, take a final backup of the sql server on the original server, and import it into the new server.

You may need to take your original server offline temporarily to ensure that you don't miss any data. Alternatively, to have zero downtime, you could make the second live, point the dns to the new server, and then update any dns entries manually on the new server, so there is effectively zero downtime. This is more hassle than a few minutes of downtime though to backup the sql and restore to the new server, but may be necessary for zero downtime.

This of course is only one use case example, and depending on your configuration and several variables, you may need to create your own strategy for the migration based on your specific case.

The other issue is in regards to the server hardware configuration. Is the new server 100% identical in hardware to the old server? If so, then the setup is easier. However, if on the far other hand, it is a totally, completely different hardware configuration, then you may need to implement a different strategy which is to simply set up the second server ahead of time, then backup all your data and sql databases on the first server and manually migrate them over, changing configuration as desired.

Server migration is by no means trivial, and in order to have a successful move, you need to have deep knowledge of servers, or staff on hand who have the same. In any case, it is highly recommended that you immediately take a full backup and store it on a third source, even on your local computer, so that if the worst case scenario happens (both servers crash and die irreparably), you still have another copy of your data to rebuild your servers with.

Hope this helps, and good luck with your server move!

Hot clone a living Linux service

4 Answers4