Test-driven development for infrastructure deployments?

Question

I've been using puppet for deployment of infrastructure, and most of the work I do is with Web 2.0 companies who are heavily into test-driven development for their web application. Does anyone here use a test-driven approach to developing their server configurations? What tools do you use to do this? How deep does your testing go?

score 3 · Answer 1 · answered Jul 06 '09 at 15:35

I don't think you could use test-driven development. But you could certainly try unit-testing on new servers.

Basically you would need to deploy servers, start up the services in a test mode, and then run tests from another server (or series of servers) against the services. Then finally put them into production.

Maybe using python scripts to connect to databases, webpages, and ssh services. And then return a PASS/FAIL. Would be a good start for you.

Or you could just roll this up into a monitoring solution, like Zenoss, Nagios, or Munin. Then you can test, during deployment; And monitor during production.

I just +1 every comment here. Wow. – Joseph Kern Aug 04 '09 at 11:50 — Joseph Kern, Aug 04 '09 at 11:50

score 1 · Answer 2 · answered Jul 21 '10 at 10:42

I believe the following links could be of interest

cucumber-nagios - project which lets you turn your Cucumber suite into Nagios plugin and which comes with step definitions for SSH, DNS, Ping, AMQP and generic "execute command" types of tasks
http://auxesis.github.com/cucumber-nagios/
http://www.slideshare.net/auxesis/behaviour-driven-monitoring-with-cucumbernagios-2444224
http://agilesysadmin.net/cucumber-nagios
There is also some effort on the Puppet/Python side of things http://www.devco.net/archives/2010/03/27/infrastructure_testing_with_mcollective_and_cucumber.php

score 1 · Answer 3 · answered Jul 07 '09 at 05:16

1

I think Joseph Kern is on the right track with the monitoring tools. The typical TDD cycle is: write a new test that fails, then update the system so that all existing tests pass. This would be easy to adapt to Nagios: add the failing check, configure the server, re-run all checks. Come to think of it, I've done exactly this sometimes.

If you want to get really hard-core, you would make sure to write scripts to check every relevant aspect of the server configurations. A monitoring system like Nagios might not be relevant for some of them (e.g., you might not "monitor" your OS version), but there's no reason you couldn't mix-and-match as appropriate.

answered Jul 07 '09 at 05:16

Zac Thompson

1,023
10
10

1

I skipped a step in the canonical TDD cycle: refactoring. For server admin this is analogous to migrating or redistributing services to achieve better configurations after each change: I think this is pretty much the job description for most admins these days already – Zac Thompson Jul 07 '09 at 05:21
This approach is largely what I'm already doing (though s/Nagios/Zabbix/) however, these changes go directly into production, and it feels like I could do better. – Jon Topper Jul 07 '09 at 10:50
How much better do you want to get? If you want to avoid doing test-first in production, you need a test environment that adequately mimics your production config. By "adequately", I mean sufficient to test your puppet automation in the test environment, and only apply to production once you are sure it's correct. Of course, this will cost a non-zero amount of money for hardware. I didn't suggest this as part of the answer because it's independent from the "test-driven" part. – Zac Thompson Jul 10 '09 at 05:33

score 1 · Answer 4 · answered Jul 26 '09 at 19:09

While I haven't been able to do TDD with Puppet manifests yet, we do have a pretty good cycle to prevent changes from going into production without testing. We have two puppetmasters set up, one is our production puppetmaster and the other is our development puppetmaster. We use Puppet's "environments" to set up the following:

development environments (one for each person working on Puppet manifests)
testing environment
production environment

Our application developers do their work on virtual machines which get their Puppet configurations from the development Puppetmaster's "testing" environment. When we are developing Puppet manifests, we usually set up a VM to serve as a test client during the development process and point it at our personal development environment. Once we are happy with our manifests, we push them to the testing environment where the application developers will get the changes on their VMs - they usually complain loudly when something breaks :-)

On a representative subset of our production machines, there is a second puppetd running in noop mode and pointed at the testing environment. We use this to catch potential problems with the manifests before they get pushed to production.

Once the changes have passed, i.e. they don't break the application developer's machines and they don't produce undesirable output in the logs of the production machines' "noop" puppetd process, we push the new manifests into production. We have a rollback mechanism in place so we can revert to an earlier version.

score 1 · Answer 5 · answered Jul 27 '09 at 14:33

I worked in an environment that was in the process of migrating to a TDD operations model. For some things like monitoring scripts this worked very well. We used buildbot to setup the testing environment and run the tests. In this case you approach TDD from the perspective of "Legacy Code". In TDD "Legacy Code" is existing code that has no tests. So the first tests don't fail, they define correct (or expected) operation.

For many configuration jobs the first step is to test whether the configuration can be parsed by the service. Many services provide some facilities to do just this. Nagios has preflight mode, cfagent has no act, apache, sudo, bind, and many others have similar facilities. This is basically a lint run for the configurations.

An example would be if you use apache and separate config files for differing parts, you can test the parts as well just use a different httpd.conf file to wrap them for running on your test machine. Then you can test that the webserver on the test machine gives the correct results there.

Every step along the way you follow the same basic pattern. Write a test, make the test pass, refactor the work you've done. As mentioned above, when following this path, the tests may not always fail in the accepted TDD manner.

Rik

Test-driven development for infrastructure deployments?

5 Answers5

Linked