27

Are there any major alternatives for automatic failover on Linux besides the typical Heartbeat/Pacemaker/CoroSync combinations? In particular, I'm setting up failover on EC2 instances, which only supports unicast - no multicast or broadcast. I'm specifically trying to handle the few pieces of software we have which don't already have automatic failover and don't support multi-master environments. This includes tools like HAProxy and Solr.

I have Heartbeat+Pacemaker working, but I'm not thrilled with it. Here are some of my issues:

  • Heartbeat - By itself, limited to two nodes. I'd like to have 3+.
  • Pacemaker - Impossible to configure automatically. Cluster has to be running with a quorum and then it still requires manual configuration.
  • CoroSync - Does not support unicast.

Pacemaker works very well, although it's power makes it difficult to setup. The real problem with Pacemaker is that there is no easy way to automate the configuration. I really want to launch an EC2 instance, install Chef/Puppet and have the entire cluster launch without my intervention.

quanta
  • 50,327
  • 19
  • 152
  • 213
organicveggie
  • 1,061
  • 3
  • 14
  • 27

9 Answers9

17

I prefer to use keepalived for high-availability. I find it simpler to setup (one daemon and config) than heartbeat and company. The only drawback I run into, is that keepalived doesn't have a unicast option by default, and only uses VRRP for communication (The author of HAProxy has written a unicast patch for keepalived however)

JimB
  • 1,924
  • 12
  • 15
  • Unicast is a must, but I'll take a look at the patch. – organicveggie Jun 03 '11 at 14:52
  • 4
    +1 I had be used to using heartbeat in all "failover" situations, until I read a post (somewhere) by the author of haproxy as to why I'd been doing it wrong (or at least inefficiently) and should use keepalived instead. It's all depends on if the important thing is failing-over a network path (e.g. moving an IP to a different server - keepalived), or needing to ensure only single access to a resource (e.g. SAN connection - heartbeat). – Coops Jun 17 '11 at 07:59
  • 5
    This is the mail @Coops is referring to, I belive http://www.formilux.org/archives/haproxy/1003/3259.html – Henrik Jun 23 '12 at 21:46
  • 4
    Since release 1.2.8 (2013-08-05) Keepalived supports Unicast (http://www.keepalived.org/changelog.html). – Dynom Aug 21 '14 at 13:58
  • Introductory article: https://opentodo.wordpress.com/2012/04/29/load-balancing-with-ipvs-keepalived/ – Vadzim Feb 27 '15 at 13:34
14

I am actually working on something very similar to what you described (a fail-over cluster on EC2), and after trying out Heartbeat, settled on Corosync as my messaging layer. Corosync will run on multiple servers and it does support Unicast (UDPU) as of version 1.3.0 (from Nov, 2010). I have setup and tested Corosync on Amazon's EC2 cloud (using Amazon's Linux AMI) and can confirm it works without issue.

A sample udpu file is installed to /etc/corosync.

Add one member block to the interface section for each node, and specify the transport as updu. (I have used the same port as heartbeat in the example below, but you can change it as desired).

e.g.:

totem {
        version: 2
        secauth: off
        interface {
                member {
                        memberaddr: 10.xxx.xxx.xxx
                }
                member {
                        memberaddr: 10.xxx.xxx.xxx
                }
                ringnumber: 0
                bindnetaddr: 10.xxx.xxx.xxx
                mcastport: 694
        }
        transport: udpu
}

(Heartbeat is supposed to support 3+ node clusters in versions 1.2.3+, although, I have never tried it personally, and don't know if it would work with Unicast).

cyberx86
  • 20,620
  • 1
  • 60
  • 80
  • I have setup a cluster of 3 machines using udpu, and that worked fine. You just keep adding member blocks to them. – devicenull Sep 24 '11 at 18:50
11

Sorry, but the part about Pacemaker is not true. The Pacemaker regression and release tests make extensive use of automation.

To configure without an active cluster, prefix all commands with CIB_file=/var/lib/heartbeat/crm/cib.xml or set it in your environment. Just be sure you remove the .sig file before starting the cluster.

For clusters without quorum, most if not all tools should support -f or --force which will instruct the cluster to accept the change anyway. If you find a tool that does not - please file a bug.

quanta
  • 50,327
  • 19
  • 152
  • 213
3

In the open source world, there's RedHat Cluster Suite. It's been several years since I've implemented RHCS so I don't have many relevant things to say about it today.

Commercially, there is Veritas Cluster Server. No experience with it.

A much simpler and open source HA tool is UCARP. UCARP doesn't provide nearly the same kind of "infrastructure" that Heartbeat/Pacemaker/CoroSync does but you can build HA solutions around it.

You can also build highly available infrastructure with virtualization technologies but these solutions tend to focus on host-level availability as opposed to application level availability.

rthomson
  • 1,059
  • 9
  • 14
  • Thanks. I'll take a look at RHcS, VCS and UCARP. I have updated my question to reflect the fact that I'm using Amazon EC2, so host-level availability isn't something I have much control over... hence why I'm looking at application level availability. – organicveggie Jun 03 '11 at 14:55
1

There is Oracle Clusterware for Oracle Unbreakable Linux, though I've not used it.

Kendall
  • 1,043
  • 12
  • 24
1

If you are already using EC2, why not use Elastic Load Balancing ? It will let you achieve application level availability without having to configure failover yourself.

manku
  • 111
  • 1
  • There several reasons ELB doesn't fit. First, ELB only works for requests coming in from the public Internet - it cannot be used for internal requests, unless you route your requests out to the public address of the ELB and then pay for all the traffic. Second, ELB is a very simple balancer - you can't apply any rules or patterns to how it works and you can't have stand-by servers. For example, you don't want two separate HAProxy instances actively pointing at the same web server because they won't have any idea of the actual load on the target web server. – organicveggie Jun 26 '11 at 14:10
1

Veritas Cluster is great (compared to Linux-Heartbeat, AIX-hacmp, HP-Serviceguard and Sun cluster), but it costs lots of money. The last time I did look at it its price was based on cpu-cores of the cluster. Current Vendor ist Symantec...

Nils
  • 7,657
  • 3
  • 31
  • 71
0

opensvc (https://www.opensvc.com) support multiple heartbeat drivers :

  • unicast
  • multicast
  • shared disk
  • 3rd site relay

and also have quorum mecanisms in case of split brain.

I managed to automatically setup a 4 nodes cluster made of 2 google cloud instances + 2 amazon instances with terraform + ansible.

Chaoxiang N
  • 1,218
  • 4
  • 10
-1

I wrote a failover cluster manager in posix shell: https://github.com/nackstein/back-to-work

take a look at it, I'm looking for someone that want to try it and help in development.

Luigi
  • 11
  • 3