0

Sometimes a server will start showing some kind of hardware failures, not disabling its functionalities, but requiring hands-on presence (it could be days to reach, if remote post).

In some such cases, the server must be kept on to preserve some inner state: it cannot be powered off, nor restarted, but at the same time, it must be put in a state as idle as possible, trying to keep it on until tech arrive.

Currently we manually disable all services, including databases, syslog, etc. Sometimes they are several dozens, and we must keep a tab on what was on or off.

I am aware of this, but it does not help much.

Is there a way to do this programmatically, keeping a record of what was enabled so to restart such services properly if server condition improves?

I am interested in an answer for any OS, but possibly also regarding illumos-based OS on-the-metal (SmartOS / OmniOS), as this is the setup we are using.

gsl
  • 133
  • 1
  • 5

1 Answers1

3

Running on top of a hypervisor like Xen or VMware gives you the option of taking snapshots including RAM, and even suspending the VM indefinitly, thus achieving what you are asking.

The problem you describe does sound like something you could avoid by taking a different approach - like avoiding keeping local state on the server in question. As you haven't shared anything about the environment you operate in or why you need this setup, it does sound overly complicated and prone to failure.

Edit

The details you give don't elaborate on the "Why".

In some such cases, the server must be kept on to preserve some inner state: it cannot be powered off, nor restarted, but at the same time, it must be put in a state as idle as possible, trying to keep it on until tech arrive.

Why?

If you need this to provision a replacement, this is what configuration management is for (puppet/ansible/cfengine + something like Foreman)

If you need this to continue operation after replacement (e.g. application state), try to keep it off that box if possible.

fuero
  • 9,413
  • 1
  • 35
  • 40
  • Thanks, I added a bit of details. As why we need this setup: to save time and effort. To write down all disabled services and bringing them back up, if allowed and possible, takes time and effort. – gsl Feb 14 '21 at 16:52
  • 1
    See my update. I'd strongly recommend using some form of configuration management. – fuero Feb 14 '21 at 17:01
  • Thanks, "If you need this to continue operation after replacement (e.g. application state), try to keep it off that box if possible" is sound advice. – gsl Feb 14 '21 at 17:31