9

We've set up a server that's running the infrastructure for a small association. So far, we've tried to manage the configuration with Ansible, but that has not been a great success. Perhaps we're doing it wrong.

In principle, the idea is that this server will be left alone most of the time, with people adding or changing things once in a blue moon. This makes it crucial that whatever is configured and running on the server is well-documented and clear, as people who do not admin the system frequently are bound to lose overview (let alone remember the details). Additionally, over time, the composition of the group of people who will admin this server will change (as people leave and join the 'committee').

We started out with a clean installation, adding roles in ansible whenever we wanted to set something up (nginx, phpfpm, postfix, firewall, sftp, munin, ..). Perhaps due to our inexperience, we're of course never able to type out a set of ansible tasks exactly the way we need it to be in one go, also because configuration is a bit of a trial and error process. That means that in practice, we would typically first configure whatever service we wanted to run on the server, and then translate to ansible tasks. You can see where this is going. People forget to then test the task, or are afraid to do so at the risk of breaking things, or worse: we forget or neglect to add things to ansible.

Today, we have very little confidence that the ansible configuration actually reflects what is configured on the server.

Currently I see three main problems:

  • It is hard to (read: we don't have a good way to) test ansible tasks without risking breaking things.
  • It adds extra work to first figure out the desired configuration, and then figure out how to translate this to ansible tasks.
  • (Ideally,) we do not use it frequently enough to build up familiarity and routine.

An important consideration here is that for whatever we end up doing, it should be easy for newcomers to learn the ropes without a ton of practice.

Is there a viable alternative that still provides some guarantees and checks (comparable to merging Ansible files to some master) that "configure things and write down what you did" fails to provide?

EDIT: We've considered committing /etc to git. Is there a reasonable way to protect secrets (private keys, etc) that way, but still have the configuration repository available outside the server somehow?

Joost
  • 177
  • 1
  • 9

3 Answers3

10

Just spin up a test/staging VM that you can use to validate your changes. Your current method of performing changes manually first is hopelessly broken and doomed to failure. You and your team need to commit to using CM properly and part of that is having a test system available. Even just a local vagrant VM would be sufficient.

Not only will this help with testing new changes, but it will also serve as a test bed for new employees (or older employees who haven't used the system in a while) to familiarize themselves with your ansible setup.

Regarding keeping /etc/ in git: no, don't do this. That directory is only a tiny portion of what ansible is changing, and having git in place there will only encourage people to make local changes.

Keep your ansible playbooks in git. Consider restricting permissions such that only you can apply ansible changes to the live server. Others can submit pull requests with their changes, which you can review and merge into master if appropriate.

EEAA
  • 108,414
  • 18
  • 172
  • 242
  • Right, that's the ideal scenario. I get that. The thing is, we're not a company, and we don't have people working on this full-time. Perhaps I made the scale of this insufficiently clear.. Every additional part (such as a vagrantfile) adds complexity that would need to be passed on, and running two configurations (i.e. one testing system where things like letsencrypt automation need to be mocked) does not aid simplicity. – Joost Nov 03 '16 at 12:49
  • 1
    Well, you asked for how to solve your problems and I gave my answer. The above is exactly how we do things at my company, and it works very well. Yes, there is additional cost in terms of server space and time required to test, but those are well worth it because we have a very high level of assurance that within minutes, we could rebuild any of our servers if needed. – EEAA Nov 03 '16 at 12:52
  • 3
    At the core, this is really a cultural and resourcing problem, not a technical problem. You haven't committed to using configuration management. Whether or not you are a company is irrelevant. You're asking for help on how to do things properly, and having a staging environment is part of that. – EEAA Nov 03 '16 at 12:54
  • Keeping a well-written README in your ansible repo goes a long way to helping your colleagues remind themselves how to make and apply changes. – EEAA Nov 03 '16 at 12:58
  • I very much agree that this is a cultural problem - it's just not one I see a clear solution to, given the limited resources. Our situations are inherently different: we're managing one server, and the goal is not to be able to rebuild it in minutes if it breaks down - the goal is to track and manage changes to its configuration. Indeed, we have not committed to using CM, as the question at hand is precisely whether we should, or whether there's a more 'lightweight' (most importantly: easy to get into) alternative as opposed to fully committing and investing in learning to use it properly. – Joost Nov 03 '16 at 13:02
  • 3
    IMHO, yes, you should commit to it. Whether or not you can convince your colleagues is another question, though. There is no lightweight way to do this that doesn't require some level of intentionality from those managing the server. Of the modern CM systems, ansible is by far the easiest to come up to speed on. You *do* want to track server changes over time. The only way to do this reliably is to use CM. – EEAA Nov 03 '16 at 13:05
  • Perhaps you're right, though - making a one-time investment to lay a proper foundation together with a thorough README on how to further build upon the ansible files can be a robust enough set-up that can be passed on to others. Maybe the only way to see is by actually sitting down and writing it up. – Joost Nov 03 '16 at 13:06
  • Virtual machines are not enough though. We've got stuff to set up like certificate issuing and mail servers that simply won't work in a VM. This adds a lot of complexity to your vagrant roles because you need to branch based on that - e.g. nginx won't run without certificates, so you need to run ssl conditionally, etc. – Thom Wiggers Nov 03 '16 at 14:28
  • Also mind that we're really talking about like 1 hour a month here in terms of touching the server, after the initial setup. – Thom Wiggers Nov 03 '16 at 14:28
  • 4
    @ThomWiggers I'm going to presume you two are on the same team since you used "we". OK, you asked how to do this properly. I gave an answer. Either you want to do it properly or you don't. Doing CM properly takes time, money, and intentionality. If you have requirements like procuring and deploying certs via LE, then stand up a $5US/month virtual machine with Digital Ocean and use that for testing. Heck, you could even just deploy it on demand when you want to test changes and then kill it. – EEAA Nov 03 '16 at 15:23
6

Perhaps due to our inexperience, we're of course never able to type out a set of ansible tasks exactly the way we need it to be in one go, also because configuration is a bit of a trial and error process. That means that in practice, we would typically first configure whatever service we wanted to run on the server, and then translate to ansible tasks.

While there are other issues (like not having a testing environment), you can have a big improvement by not doing this.

One of Ansible's core design goals is to be idempotent, which means that running your playbook multiple times shouldn't change anything (unless you've changed the plays). Thus, when I'm configuring a new piece of software, my steps are:

  1. Make changes to the Ansible tasks.
  2. Run the playbook.
  3. Examine the system, and if it's not correct, return to step 1.
  4. Commit my changes.

If you don't think you'll write the correct thing the first time in Ansible, write it anyways and iterate on it until it's right, just like any other code. This greatly reduces the chance of forgetting to Ansiblize some change you made, since every change you made was already in Ansible at some point during your development process.

Xiong Chiamiov
  • 2,874
  • 2
  • 26
  • 30
  • Yep, this is great advice. Doing this, and ensuring that you can *always* get your server back into a known-good state is very freeing - if things go south, just nuke the server and re-deploy. – EEAA Nov 04 '16 at 02:36
  • Right, I agree that this is a very solid middle-ground between where we are now and where we should be. Of course, this is how we started out. I suppose that the main reason we drifted to where we are now is that step 2 was making the whole cycle take too long. It could be that we were doing playbooks wrong. Now that we've gotten a bit more versed at writing Ansible tasks it maybe worthwhile to try again, though. In your experience, how long would a full cycle take and how often would one iterate? I realise any numbers are going to be based on all sorts of assumptions.. – Joost Nov 04 '16 at 08:40
  • 2
    A different problem I experienced with this iterative process happens when you write a task that makes changes, make the changes to the server, discover that the changes are wrong, update your task and re-apply the playbook. Now the server contains a mix of two sets of changes: the ones from the first iteration of the task, and the ones from the second. Usually the second iteration will overwrite the first, but not necessarily always. Is there a reasonable way to 'clean up' rather than 1) manually SSH'ing in to undo, or 2) starting from a clean installation every time? – Joost Nov 04 '16 at 08:57
  • Additionally, nuking the server is often not trivial *if you only have one* – Thom Wiggers Nov 04 '16 at 16:29
  • "In your experience, how long would a full cycle take and how often would one iterate?" -- I started using Ansible in January; by about June, I got to the point where I'm *faster* doing the entire process in Ansible than by hand, for most tasks. The specific time of course depends on the project, from a few minutes to a few weeks (for some particularly cantankerous software). If you find that the running of the playbook itself is slowing you down, you may want to look into using [tags](https://docs.ansible.com/ansible/playbooks_tags.html) to only run a subset during your iteration loops. – Xiong Chiamiov Nov 04 '16 at 17:53
  • Re: mixed changes: It's a difficult problem. Often you can reverse your ansible task to undo what you've done (for instance, specifying that a package be uninstalled, now that you've realized you don't need it after all), and much of this happens naturally as you change your playbooks (an old version of a configuration file gets updated to the new version). There's no *guarantee*, though, and that's just a downside you get by using an idempotent system instead of **immutable infrastructure**. (cont.) – Xiong Chiamiov Nov 04 '16 at 17:58
  • The immutable infrastructure approach solves this by never modifying an existing server, but bringing up a new one and swapping it in. The primary downsides to this are that it's time consuming and only works well where you have disposable machines, like AWS. – Xiong Chiamiov Nov 04 '16 at 18:00
0

Ansible has a ramp-up time before you exceed your prior level of productivity, but once you do you're system state is easy to assure. Your practices appear to be out-of-sync with your end goals. You can be productive with a CM toolset, while maintaining solid engineering practices, but it takes time to structure it correctly. You are essentially trading efficiency and easy of implementation, for stability and enterprise scalability. In the exact same way an experienced professional programmer doesn't write ugly hacks, the consequences always outweigh the benefits.

For starters you may have too many cooks, without clear ownership, if so expect a tragedy of the commons. Each business priority will trump the system engineering concerns every time, unless it is widely defused and what remains left reflects directly on the responsible engineer.

A CM toolset is not capable of being engineered by admins, this is what I've just come to realize. They can re-use existing work, or POSSIBLY extend upon a sound base, but even then it would require a burdensome amount of practices enforcement. What an Engineer can do, is simply NOT what an administrator can do. Many concepts in Ansible are almost the same as in a codebase, can you teach an Admin python and expect competent results? No, most certainly not, I'd expect a hack job, so you need to make the task structured enough so that a hack-job is bearable.

So you need to set things up for success, engineer solutions for points of unnecessary administration. Trade low-level systems complexity for things an admin could actually do successfully. A CM toolset will NOT save you from architectural or design mismatches.

So order is subject to modifactaion, obviously because implementation depends on what path is least disruptive for your present state.

  1. Move any business related workflow related system work to a dedicated rundeck.

  2. Split out tasks on the box, you may have two or more boxes in one right now.

  3. Reimplement your CM in a more structured manner, and follow better ansible practices, playbooks representing objects NOT functions or roles. Each system should be described in one play.

J. M. Becker
  • 2,431
  • 1
  • 16
  • 21