Configuration Management tools failover behaviour

Question

I'm currently trying to sell "DevOps" to my management one of the things I'm investigating is Configuration Management tooling. One of the big things for us is that we have a system that has high availability and a good failover behavior.

For CF-Engine this isn't an issue as every node can be configured to run as a server and the runs will continue if the server isn't available.
For Puppet you have a choice of Master/Masterless modes and their pro's and cons.
For Chef the initial run requires the master server to fetch the policy but after that any run will continue with that current policy if the master isn't available.
For Salt if the master server goes down then configuration is not enforced as all actions are done on the master
For Ansible (like salt) if the master server goes down then configurations are no longer enforced, as again all actions are done by the master servers

I'm not including Windows PowerShell DSC in this list as my current user case is that I would use PowerShell DSC to assist in managing Windows systems with either Puppet, Chef, Ansible, Salt or CF-Engine as the overall managing tool

I want to know if my understating is correct of each to tools and if it isn't why?

Ansible has no notion of "master server". You may run your playbooks from any computer with the Ansible binaries installed. I'd also argue that your use of the term "policy" is ambiguous. — jscott, Sep 27 '16 at 11:29
@jscott: You're right I should phrase that better I should say configuration...... I'll change it now. Am I right in thinking that the behavior is a lot like git in that respect you have no actual "Master" but a notional one or have I got it completely wrong? — Bicker x 2, Sep 27 '16 at 11:37

score 3 · Answer 1 · answered Sep 27 '16 at 11:44

I will comment only on the ones i have experience with, that means Puppet and Ansible. And I'm omitting some details.

Both can be setup to run agentless or local only if needed. To use them local only you obviously need some way to transfer the needed manifests / playbooks to the target machines and run them there.

Talking about Puppet usage with masters, you can have redundancy using a load balancer with the actual masters behind.

In Ansible instead there is no master concept, each machine that can connect to the managed machines with ssh / powershell can do, provided you have a way to access the playbooks. Maybe you meant Ansible Tower, which uses a DB for it's operation, and you can cluster it if needed.

This brings us at the real redundancy in both cases, that is the actual scripts. In nearly all cases i have seen those stay at a git repository, so it's inherently redundant, just cloning it and you can have how much "master" copies as you wish.

Hope this helps.

Thank you for that, it was very helpful (it is a shame you can't have multiple answers). I had indeed confused Ansible and Ansible Tower. Thanks — Bicker x 2, Sep 27 '16 at 12:08

score 2 · Answer 2 · answered Sep 28 '16 at 07:06

If you look at salt, the only information that makes up a working connection between master and minions are:

the fact that the minion can resolve the master ip somehow
the minions public keys in the /etc/salt/pki/master directories

If your salt master dies, the systems will keep on running with no effect. But you are right, you cannot run any changes to your configurations while the master is gone. So a question is how fast can you get it back?

You can simply reinstall the master and start it up - you can accept your minions keys again (or reinstall an potential backup) and you are at the same place where you left off with your old master. If you cannot reuse the same machine, than you would need to point the minions to the new master somehow.

No state data in a database that might be corrupted or gone. That for me is the beauty of it. Its an overlay, it does not squeeze in. Not - as an other way example - like juju, where when your database is gone your systems act like they are beheaded and you have to reinstall.

There is also Multimaster and Syndic in Salt - High Availability is a long standing topic in its development.

Thank you for your answer it's a shame I can't mark multiple questions as correct — Bicker x 2, Oct 17 '16 at 15:18
It should also be noted that with Salt-SSH you can easily have several master servers, and/or you can use a HA setup such as pacemaker to apply policies from a different server if the primary is down, you just have to have a current copy of the master data. — Josip Rodin, Apr 10 '17 at 20:03

score 1 · Answer 3 · answered Oct 10 '16 at 07:58

To round things out with the above, Chef (if using chef-client, chef-solo is purely local and has no server component that could fail) requires the server on every run. There are ways to use the cache data in the event of an outage but its definitely not the default behavior, or even easy. We recommend you run Chef Server in a redundant/clustered system with one cluster per failure zone. Check out the chef-backend product for clustering and Facebook's Grocery Delivery for multi-server sync.

Again thank you, sadly this is one of these questions were one answer doesn't work to well — Bicker x 2, Oct 17 '16 at 15:17

score 0 · Accepted Answer · edited Apr 13 '17 at 12:14

So first, I want to thank jscott, Fredi, user378016, and coderanger for all there answers.

To answer my own question

For CF-Engine this isn't an issue as every node can be configured to run as a server and the runs will continue if the server isn't available.

This is all well documented on the CF-Engine website an example can be found here: https://cfengine.com/learn/how-cfengine-works/

For Puppet you have a choice of Master/Masterless modes and their pro's and cons.

Puppet has a variety of modes and as Fredi has indicated the mode is one or the other. However after doing more digging Puppet is very flexible and has good features that can be supported for both master and masterless mode.

For Chef the initial run requires the master server to fetch the policy but after that any run will continue with that current policy if the master isn't available.

So this wasn't quite correct, when running is server mode (not using chef-solo) a run requires a connection to the master for each run. As has been mentioned there are ways to do fall back caching that has some interesting potential and maybe worth looking into some more.

For Salt if the master server goes down then configuration is not enforced as all actions are done on the master

So thanks to user378016 for confirming, I think the answer provided does answer this quite nicely (permalink: https://serverfault.com/a/805791/225383)

For Ansible (like salt) if the master server goes down then configurations are no longer enforced, as again all actions are done by the master servers

So Ansible is a tricky one (again thanks to Fredi for his answer). It gives the strong benefit of only having to install the Ansible software on one server. The playbooks that stored on this "master" don't necessarily run on the master but can be applied on others via SSH. This of course requires all these boxes that you wish to configure being accessible via SSH and meet certain pre-conditions (as outlined in a playbook). In certain cases is not always that desirable.

Edit: I should add Ansible can run in a way that is similar to the puppet-masterless, chef-solo, by installing Ansible on the node to be managed and having it pull the configuration from somewhere like git and then apply the configuration locally.

For those that are interested in the direction I'm going in recommending CF-Engine, Chef or Puppet. While Ansible and Salt are both interesting, for my user case however it is not the optimal solution as I need to be able to ensure that policy's are enforced no matter what with good reporting metrics a reliance that SSH is always available is not really an option (and yes while we could install the server components on every box or a scheduler of some sort to force a configuration this seems counter intuitive to their original architecture to begin with).

All of these products are very good , and have different strengths and weaknesses, in this case I felt that Ansible and Salt are not suitable not only for the reason above but also for various other reasons as well.

It's not quite clear what your requirements really are. So you have an extended outage of the central nodes (the source of truth), and you want the managed machines to keep enforcing the latest known policy prior to the outage. But if that in turn starts failing, what then? Does anything or anyone get notified, through the CM systems or otherwise? Where does the requirement for high availability end - on the backend provisioning systems or on the critical production systems that they manage? — Josip Rodin, Apr 10 '17 at 20:52
@JosipRodin, So in this case I was as specific as I could be as when I was looking at this it was part of a larger "decision matrix" for my employer on what CM tool I would recommend using. In this case I struggled to find concise information on HA and redundancy (hence the question). — Bicker x 2, Apr 19 '17 at 13:08

Configuration Management tools failover behaviour

4 Answers4

Linked