Options for Multisite High Availability with Puppet

Question

I maintain two datacenters, and as more of our important infrastructure starts to get controlled via puppet, it is important the the puppet master work at the second site should our primary site fail.

Even better would be to have a sort of active / active setup so the servers at the second site are not polling over the WAN.

Are there any standard methods of multi-site puppet high availability?

Did I understand you question right? You are looking for a way to have redundant puppet master in case puppet master is unavailable? — Hrvoje Špoljar, Oct 01 '12 at 15:56
It kinda depends on how you are using puppet. There is a lot of flexibility. For example, are you using stored configs? — Zoredache, Oct 01 '12 at 15:57
Have you looked at "masterless puppet"? The essence of it is that each agent has a checkout of the manifests and they apply them locally. You end up with `git` or `svn` or `rsync` or whatever version control system you use being what you need to scale out rather than the puppet master. — Ladadadada, Oct 01 '12 at 16:08
Just a hint solving the active-active question: You could use anycast to announce the same (_"virtual"_ / _"Service-"_) IP from both datacenters. We do this for our resolving DNS Servers. In each datacenter our loadbalancers announce the same anycast IP. Our routing prefers the local loadbalancer but falls back to the other DCs in case of failure (~"not longer announcing the anycast IP"). — Michuelnik, Oct 01 '12 at 16:09
@HrvojeŠpoljar The way I understand it, that's part of a compound goal - basically it sounds like Kyle's trying to maintain two synchronized puppetmasters - one at a primary site, and one at a DR site. Should a Disaster occur at the primary site it's important that the Recovery at the remote site include a working configuration management system :) — voretaq7, Oct 01 '12 at 16:53
I see one of the new features for puppet 3.0 is [SRV record support](http://projects.puppetlabs.com/issues/3669), something Windows people are well familiar with and could help with Site stuff. — sysadmin1138, Oct 02 '12 at 00:29

Shane Madden · Accepted Answer · 2012-11-16T07:51:42.050

Puppet actually lends itself pretty well to multi-master environments, with caveats. The main one? Lots of parts of Puppet like to be centralized. The certificate authority, the inventory and dashboard/report services, filebucketing and stored configs - all of them are at their best in (or simply require) a setup where there's just one place for them to talk to.

It's quite workable, though, to get a lot of those moving parts working in a multi-master environment, if you're ok with the graceful loss of some of the functionality when you've lost your primary site.

Let's start with the base functionality to get a node reporting to a master:

Modules and Manifests

This part's simple. Version control them. If it's a distributed version control system, then just centralize and sync, and alter your push/pull flow as needed in the failover site. If it's Subversion, then you'll probably want to svnsync the repo to your failover site.

Certificate Authority

One option here is to simply sync the certificate authority files between the masters, so that all share the same root cert and are capable of signing certificates. This has always struck me as "doing it wrong";

Should one master really see its own cert presented in client auth for an incoming connection from another master as valid?
Will that reliably work for the inventory service, dashboard, etc?
How do you add additional valid DNS alt names down the road?

I can't honestly say that I've done thorough testing of this option, since it seems horrible. However, it seems that Puppet Labs are not looking to encourage this option, per the note here.

So, what that leaves is to have a central CA master. All trust relationships remain working when the CA is down since all clients and other masters cache the CA certificate and the CRL (though they don't refresh the CRL as often as they should), but you'll be unable to sign new certificates until you get the primary site back up or restore the CA master from backups at the failover site.

You'll pick one master to act as CA, and have all other masters disable it:

[main]
    ca_server = puppet-ca.example.com
[master]
    ca = false

Then, you'll want that central system to get all of the certificate related traffic. There are a few options for this;

Use the new SRV record support in 3.0 to point all agent nodes to the right place for the CA - _x-puppet-ca._tcp.example.com
Set up the ca_server config option in the puppet.conf of all agents

Proxy all traffic for CA-related requests from agents on to the correct master. For instance, if you're running all your masters in Apache via Passenger, then configure this on the non-CAs:

SSLProxyEngine On
# Proxy on to the CA.
ProxyPassMatch ^/([^/]+/certificate.*)$ https://puppet-ca.example.com:8140/$1
# Caveat: /certificate_revocation_list requires authentication by default,
# which will be lost when proxying. You'll want to alter your CA's auth.conf
# to allow those requests from any device; the CRL isn't sensitive.

And, that should do it.

Before we move on to the ancillary services, a side note;

DNS Names for Master Certificates

I think this right here is the most compelling reason to move to 3.0. Say you want to point a node at "any ol' working master".

Under 2.7, you'd need a generic DNS name like puppet.example.com, and all of the masters need this in their certificate. That means setting dns_alt_names in their config, re-issuing the cert that they had before they were configured as a master, re-issuing the cert again when you need to add a new DNS name to the list (like if you wanted multiple DNS names to have agents prefer masters in their site).. ugly.

With 3.0, you can use SRV records. Give all your clients this;

[main]
    use_srv_records = true
    srv_domain = example.com

Then, no special certs needed for the masters - just add a new record to your SRV RR at _x-puppet._tcp.example.com and you're set, it's a live master in the group. Better yet, you can easily make the master selection logic more sophisticated; "any ol' working master, but prefer the one in your site" by setting up different sets of SRV records for different sites; no dns_alt_names needed.

Reports / Dashboard

This one works out best centralized, but if you can live without it when your primary site's down, then no problem. Just configure all of your masters with the correct place to put the reports..

[master]
    reports = http
    reporturl = https://puppetdash.example.com/reports/upload

..and you're all set. Failure to upload a report is non-fatal for the configuration run; it'll just be lost if the dashboard server's toast.

Fact Inventory

Another nice thing to have glued into your dashboard is the inventory service. With the facts_terminus set to rest as recommended in the documentation, this'll actually break configuration runs when the central inventory service is down. The trick here is to use the inventory_service terminus on the non-central masters, which allows for graceful failure..

facts_terminus = inventory_service
inventory_server = puppet-ca.example.com
inventory_port = 8140

Have your central inventory server set to store the inventory data through either ActiveRecord or PuppetDB, and it should keep up to date whenever the service is available.

So - if you're ok with being down to a pretty barebones config management environment where you can't even use the CA to sign a new node's cert until it's restored, then this can work just fine - though it'd be really nice if some of these components were a bit more friendly to being distributed.

+1 for the CA stuff. Note that you can sync/version control all the CA goodies and simply not activate any of it on the "standby" puppetmasters until a failover situation arises (at which point you turn up the CA bits on your new "master" and update the `SRV` record accordingly -- `SRV` records strike me as the most elegant solution here despite my general ambivalence toward them...) — voretaq7, Oct 02 '12 at 19:21
@voretaq7 That's a good point - a purely fail-over setup would be a lot less legwork than this kind of active/active deployment. — Shane Madden, Oct 03 '12 at 21:34
As an addendum, I've also contributed an update to the multi-master scaling guide in the puppet docs which has good information as well: http://docs.puppetlabs.com/guides/scaling_multiple_masters.html — Shane Madden, Mar 23 '13 at 19:08

score 8 · Answer 2 · answered Oct 01 '12 at 16:43

The "masterless puppet" approach Ladadadada describes is the one I'm most familiar with (it's basically what we do with radmind at my company). I guess more accurately it's "Multiple masters synchronized by an external process", where any given server could (theoretically) serve our entire universe in an emergency.

In our case because of the nature of radmind we simply rsync the transcripts and data files from an approved master to each remote site's radmind server, and clients pull their updates from the server with short hostname radmind (through the magic of resolv.conf this evaluates to radmind.[sitename].mycompany.com - always the local radmind server. If the local server is down it's easy enough to override and point to any other site's server).

This sort of rsync process would probably work in your situation as well, but it's probably sub-optimal compared to a version-control based solution.

For puppet or chef a version-control-based system makes more sense than simple rsync for a few reasons - the big one being that you're version-controlling puppet scripts (rather than entire OS images as you would with radmind).
As added benefits of version-control based management you can have multiple people working on the repository at once (great win for parallelism), you get revision history essentially for free, and if someone breaks the Puppet environment you have easy rollback (presuming you're using git you also have git blame which does what it says on the tin).
Creative branching and merging even lets you handle a major OS upgrade or other transition within the version control framework - Once you get it right simply switch to the new branch and (hopefully) the production push will Just Work.

Were I implementing this here I'd probably take advantage of pre-commit and post-commit hooks in git to ensure that the puppet configurations being committed are sane (client-side pre) and push them out to the rest of the universe if they are (server-side post -- possibly also triggering an environment update if your deployment policies allow such behavior).

In terms of bringing up new puppetmaster servers at each site, you can simply check out the puppet environment to each remote puppetmaster, and use either the resolv.conf/hostname hackery I described above or anycast service IPs redirected to local systems like Michuelnik suggested (the latter is handy if you want automatic fail-over if one site's puppetmaster blows up) to handle making sure each site sees the "right" puppetmaster and doesn't clog your WAN links trying to get updates.

The folks at Brain Tree Payments have apparently combined the version control and rsync solutions along with some custom Capistrano tasks - their solution seems to be half-baked in the sense that it relies on manual workflow elements still, but it could be adapted and automated without too much work.
The paranoid compulsive tester in me has a fondness for their noop sanity-check step - the hater-of-manual-processes in me wishes for some level of automation around it...