1

I'm trying to choose a configuration management system for 500-2000 very-geographically-distributed hosts. Due to varying network reliability, it's possible that a number of hosts may be temporarily unavailable at any given time. For this reason, my initial choice was Chef, since it uses a "pull" model, and when hosts come online and check in, they'll immediately get current configuration.

However, if my hosts only poll the Chef server for new configuration every 30 minutes, rapid deployments are impossible. Also, I am not a Rubyist. I would prefer to use a push-based model, where I can push configuration to hosts as rapidly as possible. So, the natural choices seem to be Ansible or SaltStack (probably SaltStack). But my question is: How do Ansible and SaltStack handle failed or down hosts? Is there some way to keep retrying a push forever until a host comes back online? Are there existing patterns for properly handling eventual consistency of down-hosts with either of these tools? Thanks!

Will
  • 1,127
  • 10
  • 25

3 Answers3

2

I can only answer this for Ansible.

Ansible itself does not handle hosts which are not reachable. It will try to connect and once and if this is not possible the host is thrown out of the current play. But Ansible gives you some tools to deal with this yourelf.

First there is the wait_for module. With this you could wait with a very high timeout until the hosts are available.

- wait_for:
    port: 22
    delay: 10
    timeout: 3600
    host: "{{ inventory_hostname }}"
  delegate_to: localhost

This alone though would be a problem when you run the play, because Ansible by default would not process any further tasks until all hosts pass this task. Which is contra-productive in this case. According to your description the first hosts could be again unavailable when the last host finally was reachable.

For this to solve you need to use Ansible 2, which has a new feature called strategies. strategy: free allows you to run every task as fast as possible, which means it runs all tasks as soon as the host is available.

Still, a connection could go down and in this case there is no built-in way to automatically retry. If the ssh connection can not be established a fatal error will be thrown for this host and since Ansible ~1.9. there is no way to catch this kind of connection error. That does not affect other hosts though, they will all play fine.

You can retry though. Failed hosts will be stored in a file <playbook-name>.retry next to the playbook itself. To retry only failed hosts you then could run:

ansible-playbook ... --limit @<playbook-name>.retry
udondan
  • 2,001
  • 14
  • 18
  • Very very informative, thanks! I suppose a script or daemon could just keep trying to run against `hosts.retry`... – Will Feb 24 '16 at 07:09
2

Salt runs in a pull model from the nodes to the master. You can issue global commands from the master like

salt 'api*.domain.com` state.highstate

That will run a highstate on all hosts that has a id(hostname) of api*.domain.com. A highstate is like a full chef run.

Usually by default people will either have the master schedule highstate runs on minions or they will run the schedule on the minions themselves to say run a highstate every 10 minutes.

So if a node is down and you run a command on the master to run a state then salt will report the node is down in its run output which can be formatted in many different ways for you to ingest. It can even be logged to mysql for example.

So for example if you ran the above command on the master server to run a highstate on all api*.domain.com nodes. If 2 of the 5000 were currently rebooting once salt-minion came back online they would get the even from the master via the message bus and run the highstate.

Salt also has a thing called proxy nodes to help the load of a master. You could have a single master somewhere and a proxy node in each datacenter and all the commands sent from the master go through the proxy nodes and the minions in those datacenters hit their proxy node and never the master

Mike
  • 21,910
  • 7
  • 55
  • 79
  • Small addition: you can also add Salt Startup states to a minion (https://docs.saltstack.com/en/latest/ref/states/startup.html). This way you can schedule a pull once a node comes up, including a `state.highstate`. More complex setups are possible with Salt Reactor (https://docs.saltstack.com/en/latest/topics/reactor/index.html) where the Master receives an event once a Minion is started and can then trigger any command on hosts of your choice as a result. – ahus1 Feb 28 '16 at 16:40
  • Thanks! I'm going to accept this one, although the other answers were great too, this is a viable solution for my project. I also very much appreciate @udondan's answer for Ansible. – Will Mar 17 '16 at 01:01
1

To extend Mike's answer, you can do push and pull simultaneously with Salt. Pushing is as easy as

salt 'api*.domain.com` state.highstate

At the same time, your minions can do scheduled pull every X minutes or hours via the built-in scheduler. My preferred method is to configure it via pillar, but adding it to the minion config works too. Something like:

schedule:
  highstate:
    function: state.highstate
    maxrunning: 1
    hours: 1
    splay: 600
Emmanuel BERNAT
  • 312
  • 1
  • 6
savamane
  • 91
  • 2
  • Thanks so much for this addition, very helpful. I accepted @Mike's answer, but your extension helps a lot. Thanks!!! – Will Mar 17 '16 at 01:02