Hardware failover of linux server without clusters or vm?

Question

Any ideas on how to provide complete failover for a SLES11 installation without implementing clusters or vm?

The applications that are running are not (and will never be) cluster aware. There are no incoming client connections - this is not a webserver. This server transforms data and produces output on disk and simultaneously drives production printers at high speed (via 1GbE, 10GbE, and/or infiniband.

The current solution is 2 identical servers with one of them attached to (and booted from) external storage. In the event of primary server failure, the external storage is disconnected from the failed server, and connected to the second server which is then booted from this storage. Works fine, except it is not automatic, and if ever both servers are attached and powered on, the OS is destroyed.

Been looking for a solution that allows both servers to be permanently cabled to the external storage and with some kind of automatic mechanism to remove power from a failed server and restore power to the backup server in the event of failure. This 'automatic mechanism' cannot involve a 3rd server - that just increases cost and complexity.

Clusters won't work as licensing will not allow these applications to 'move', plus no matter what anyone says, clusters are not simple and they require a dedicated admin to monitor - you cannot install one at a customer location and just leave (try supporting this at 200 customer locations). VM won't work as dedicated nics are required for communication with output devices (so no vmotion), plus the 15-25% performance hit is a non-starter.

The servers are HP Proliant - it seems the obvious solution would be for HP to implement this functionality in ILO, have the BIOS monitor for server fault (the bios already can do this and be set to automatically reboot the server) and then via ILO, control the power of both servers for failover. i.e. The running server detects a fault, ilo wakes up the failover server, then ilo removes power from the failed server - if Ilo cannot remove power from the failed server, it shuts down the failover. ILO can't do this today. There may be no way to even guarantee that this would work.

Another possibility (already implemented) requires 2 running licensed servers containing a mirror of all output data - if 1st server goes down, the 2nd server can complete the output work. But this requires 2 installations at double the licensing cost (many many $1000s). Is there any other way? Remember, I'm looking for a way to do this without requiring the use of any additional servers.

I've tested the NEC Fault Tolerant servers (this is the hardware that Stratus and Winchester offer) but they don't support native SLES (you have to use vsphere), and they only maintain consistency on their internal raid 10 disks - of which there aren't enough drives per server. My current solution uses external storage - HP D2700 (up to 8 of them), each one with up to 25 2.5" sas drives, cabled (only one at a time) to a smartarray card in each server - because I need 400-500 MB/s sequential read speed - that's why there is no san or nas in this config.

Edit:
Let me try to simplify my question.

Would it be possible to provide automatic failover of a single SLES11 server (all data, all applications, all services) to another identical server (a cold spare) with only a single O/S to install and maintain. The idea is to maintain the performance that dedicated hardware provides, but to have a plan for when that hardware (memory, cpu, p/s, etc.) fails.

Is this possible without clusters, vm, drbd, linux-ha, etc.? This can be done today with direct attached external storage (dual port, RAID10) but someone has to power off the failed server and move the cable to the failover, and turn it on. Anybody (with no training) can support this. But it isn't automatic.

Pssst... http://en.wikipedia.org/wiki/Enter_key http://en.wikipedia.org/wiki/Paragraph — ceejayoz, Sep 24 '13 at 19:04
"This 'automatic mechanism' cannot involve a 3rd server - that just increases cost and complexity." Increased uptime generally involves increased cost and complexity. — ceejayoz, Sep 24 '13 at 19:06
Please fix the formatting of this. Without reading that large block... keepalived comes to mind. — Daniel Widrick, Sep 24 '13 at 19:20
-1 for complete lack of effort in making this even somewhat readable. Do you want help from us or not? — EEAA, Sep 24 '13 at 19:29
Can you explain your licensing problem? What you want is a cold standby failover cluster. Even if you do not want to name it a cluster. — Nils, Sep 24 '13 at 19:58
The licensing contains many components which are 'checked out' from a usb dongle. Unless the components are 'returned' they cannot be used elsewhere. That restricts me to one installation of the OS (which eliminates cold standby failover cluster I think). unless multiples of all components are licensed (multiple times more expensive than a couple of servers). — steve, Sep 25 '13 at 13:45
Pressing the 'enter' key submits my comments, it does not provide a new line. Licensing contains many components which are 'checked out' from a usb dongle. Unless the components are 'returned' they cannot be used elsewhere. For that and other reasons, I am restricted to one installation of the OS (which eliminates cold standby failover cluster I think). The 'other reasons' mainly are complexity - only 1 OS to update (hp drivers, patches, application software, etc.). I understand that what I am looking for may not be possible - I just want to be certain. — steve, Sep 25 '13 at 13:52
A bunch of scripts + replicating or cluster filesystem like ceph may work. — hookenz, Sep 26 '13 at 02:17

ewwhite · Answer 1 · 2013-09-25T20:54:03.553

The easiest commercial solution to take care of what you're looking for is SIOS Lifekeeper. See: http://us.sios.com

I've used it in situations where I had applications that were not cluster-aware but needed some level of resiliency.

SIOS will handle your VIPs, application startup/failover/shutdown, storage (network RAID or shared) and any custom scripts you may need. Easy to work with. Roughly $2k per server.

score 0 · Answer 2 · answered Jan 29 '14 at 11:15

First I see no reason, why the OS should be "destroyed".

The only problem you are IMHO facing is that you will have different MAC-adresses on the server.

If they are the same type with identical PCIe-slot assignment and network hardware there is an easy solution to this problem:

Change the NIC-name-bindings in udev from MAC-based to PCIe-slot-based.

The location to change this on sles 11 is: /etc/udev/rules.d/70-persistent-net.rules

Example - assuming ethtool -i eth0returns 0000:04:00.0 as PCIe-slot-location:

SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", KERNELS=="0000:04:00.0", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth*", NAME="eth0"

You do not have to change anything in /etc/sysconfig/network/ifcfg-* for this.

Hardware failover of linux server without clusters or vm?

2 Answers2