good failover / high availability solutions for linux?

Question

I have several cases where I need applications to be migrated from one server to another in the event of a failure (server hang or crash).

On solaris we do this with VCS (Veritas Cluster Server). What options are available for Linux?

Please indicate level of effort to setup/maintain or cost (if any) for each.

-- More details added --

To give a idea of the complexity level:

failing server could hang or crash without notice, may still be 'ping-able'
recovery server needs to start up it's applications on failover
once failing server boots/power-cycles, it becomes passive as not to intefere with the recovery server.

This is a data collection or compute node, not a database, so simpler solutions could work.

-- even more details (sorry) --

shared storage is not an option, but not much state (if any) needs to migrate from one server to the other. We keep the two servers in sync via rsync.

Thank you very much for all the posts so far.

score 10 · Answer 1 · answered Jun 04 '09 at 23:24

10

http://linux-ha.org/ for all your high-availability needs. Like the song says, the best things in life are free.

answered Jun 04 '09 at 23:24

womble

95,029
29
173
228

can you recommend a good book? – slf Feb 06 '14 at 15:52

score 4 · Answer 2 · answered Jun 05 '09 at 00:00

I have used a variety of cluster solutions on Linux. I'm also a configuration management proponent, so I'll add a bit about that in my descriptions (Chef or Puppet, that is)

Veritas Cluster Server (VCS). It's been awhile, but we deployed a few Linux VCS clusters on RHEL 3.0. I would hope its available on RHEL 5.0. You should be familiar with the difficulty in setting this up, as its familiar territory. As you may be aware, VCS is expensive. Anecdotally, VCS is not well suited to being set up by configuration management.

Speaking of RHEL, Red Hat Cluster Suite has matured a lot since its original release with RHEL 2.1. The setup/configuration phase is pretty straightforward, and the documentation is very complete and helpful, and like VCS you can purchase support from the vendor. For commercial HA products, RHCS is reasonably priced. I would only use configuration management to install the packages, and maintain them "by hand" through the web interface. Also, I've heard of some people using it on non-Red Hat platforms, though I don't have experience with that directly.

Linux-HA (drbd/heartbeat) are great as well, though coming from VCS the configuration may seem simplistic, yet unwieldy. This is pretty easy to automate with a configuration management tool.

As a proof of concept, I've installed a Linux cluster with IBM's HACMP - their AIX clustering software. I would not recommend this, as I recall it is more expensive than even VCS. IBM has specific procedures for installing and maintaining HACMP, I would not use configuration management here.

Karl Katzke · Answer 3 · 2009-06-05T02:33:54.493

Michael is correct that the community is a bit fractured right now, and documentation is a tad sparse.

Actually, it's all there, it's just impossible to understand. What you really want is the "Pacemaker Configuration Explained" ebook... (Link to PDF). You'll want to read it about a dozen times, and then try to implement it, and then read it another dozen times so that you can actually grok it.

The best supported implementation of cluster services for Linux at this point is probably going to be Novell's SLES11 and it's High Availability Extension (HAE). It JUST came out a month or two ago, and it comes with a nice thick 200 page manual that describes how to set it up and get things running. Novell has also been excellent about supporting Pacemaker configurations in various forms.

Beyond that, there's RHEL5's implementation, which has the same package and decent documentation, but I think it's more expensive than SLES. At least, it is for us.

I would avoid Heartbeat right now and go with Pacekmaker/OpenAIS because they're going to be much better supported going into the future. HOWEVER, the current state of the community is such that there are a few experts, there are a few people who are running it in production, and there are a whole ton of people that are completely clueless. Join the Pacemaker mailing list and pay attention to a man named Andrew Beekhof.

Edit to provide requested details:

Pacemaker/OpenAIS uses a 'monitor' operation on a 'primitive resource' (e.g. nfs-server) to keep track of what the resource is doing. If the example NFS server goes unresponsive to the rest of the cluster for X number of seconds, then the cluster will execute a STONITH (Shoot The Other Node In The Head) operation to shut down the primary node, promoting the secondary node to active. You decide in the configuration what to bring up afterward and associated actions to take. Implementation details from there depend on what service you're trying to make fail over, execution windows for certain operations (such as promoting the primary node back to master) and the whole thing's pretty much as configurable as possible.

score 1 · Answer 4 · answered Jun 04 '09 at 23:31

The Linux HA community is a bit partitioned at the moment.

The Tools Formerly Known As Linux HA are currently Pacemaker and OpenAIS, these are most often run in combination with DRBD when a shared-nothing architecture is needed.

I suggest getting a good book on this topic before diving in, since this is a quite comprehensive area, and the state of the projects is not necessarily as user-friendly as some vendor solutions.

There are also Linux solutions by some of the cluster software vendors, but I can't tell you much on those since I have never used them myself.

Could you suggest a good book or two on this subject? – Matt Simmons Jun 05 '09 at 00:05 — Matt Simmons, Jun 05 '09 at 00:05

score 1 · Answer 5 · answered Jun 05 '09 at 05:39

With Linux we have implemented clustering with heartbeat and drbd. Heartbeat checks the status of the server. DRBD is used for data sync between servers. We have oracle service running on one server and apache on another server. When server running oracle fails, heartbeat senses the same and restores oracle service on server running apache. and vice a versa. Have been using this setup for many other purposes and have been reliable till date.

score 1 · Answer 6 · answered Jun 05 '09 at 08:33

Red Hat Cluster Suite will do what you want for just about every possible application. In combination with GFS and Cluster LVM you can have solid shared storage.

Maintenance is not much more difficult then keeping the individual boxes running. The application migration makes it easier, actually, to patch the individual boxes.

RHCS comes with a web frontend (Luci) and a GTK frontend (system-config-cluster) to make configuration and migration clickable. It'll let you configure failover domains per application, recovery policies, fencing, all from one central, web-based management console.

Considering the fact that RHCS actually has a pretty solid support option, I'd go for RHCS.

Not sure how much this would cost you, but I figure it's in the range of several thousand dollars.

score 1 · Answer 7 · answered Jun 05 '09 at 08:46

1

UltraMonkey, its partly build on top of Linux-HA framework. I've always thought of it as more a load-balancing solution than a true cluster, but it handles fail-over well.

answered Jun 05 '09 at 08:46

gbjbaanb

3,852
1
22
27

score 0 · Answer 8 · answered Jun 05 '09 at 08:13

We use Linux Virtual Server and keepalived for our high availability. keepalived can either do VRRP on the hosts themselves (which I believe relies on the other server dying) or you can set it up on an separate host to do load balancing, which can have service availability checks. It may be possible to configure service checks in the first situation but I've not checked. The second situation is particularly good if you can have both servers running at the same time, otherwise you can do a manual switch over.

score 0 · Answer 9 · answered Jun 05 '09 at 13:49

I wrote a software-based load balancer for TCP which does not require a separate machine. It shares a single IP address by announcing it on a multicast link-level address and negotiating between machines to avoid two machines serving the same TCP connection.

The down side is that it is not really production-ready - but if you want to test it on your test network I'd be pleased.

Fluffy cluster is here

I don't necessarily detect a "alive but sick" situation, but I do do load-balancing between the member servers (if the userspace process dies the other nodes will notice and remove the failed node)

score 0 · Answer 10 · answered Jan 25 '11 at 19:16

0

It's not free, but those who don't have the time or expertise to install their own HA solution on Linux, the answer is at www.rapidscaleclusters.com. Within minutes you are up and running, easy to install and run, it's also supported.

answered Jan 25 '11 at 19:16

Vince Bryant

11

not sure why someone marked this down... this looks like a viable solution (though there are always technical gotchas... at least this doesn't look like a 'service' which was my first impression). – ericslaw Feb 04 '11 at 22:23

score 0 · Answer 11 · edited Nov 14 '15 at 22:30

I'm working on an open source failover cluster manager written in shell script. It's in good shape even if it can miss some integration you will need. Check it out and let me know if there are some missing feature that you would like to see and use: https://github.com/nackstein/back-to-work/

if you are good at shell programming (POSIX shell) you are welcome to join the project development :D

good failover / high availability solutions for linux?

11 Answers11