Trying to cut down our deploy times and looking for suggestions and shortcuts you or your team uses to get back in the terminal and out of the data center. Looking at the entire process from ordering gear to end of life.
3 Answers
If you are large enough to worry about big deployments you are large enough to use some sort of database for machine information. It should contain info about IP addresses, MAC addresses, and machine names and roles as well as the normal model and vendor info. Use this to populate configuration and installation tools.
If it is just a few machines, the sys admins may be the best choice for rack and stack. If the deployment is more than 5-10 machines hire a contractor to do the unpacking, racking, and cabling. They do this frequently enough that they can accomplish the task faster and for less money than 1-2 sys admins and some volunteers (read interns).
Have an automated installation setup. For Linux this means something like FAI (Debian & Ubuntu) or kickstart (RHEL & CentOS). Solaris uses jumpstart and Windows uses WDS. x86/x86_64 hardware almost always supports dhcp and PXE. You may need to use bootp for other servers. Use the database mentioned above to feed the configuration. Test that the installation configuration scripts do what you expect. Then have your machines turned on as the final step of of the hardware installation.
Have a configuration management system that has definitions for the types of machines and services you use. Puppet and cfengine are popular but there are many others. Have server roles come from the database mentioned above. This is vital as you grow. The configuration management tools will ensure that all servers have the right version of software and all of the needed configuration for the services they provide. Call this on initial boot after your install. Run through a few iterations with a fresh install to ensure that everything is right.
If possible, once everything is installed, give everything a few days of run-in before you start throwing user traffic at it. Set your monitoring to email you if there is a problem but not to page anyone during the burnin. If a burnin isn't possible be prepared for more problems than normal until you have sorted out any early problems.
After each deployment have a retrospective. Identify what went well and what didn't. Determine what needs to be improved and make the improvements. This can be as formal or informal as you want. The retrospective is as important as the other steps. It is how you improve the process.
I cannot stress enough that you will need to test the process before and during any deployment. The tests should be as automated as you can make them. As you become more familiar with the gotchas for your deployments you should improve your testing.
This is how a small team (2 people) can add 50 or more machines in less than 12 man hours not including the time spent unpacking, racking, cabling, etc.
- 1,500
- 9
- 12
- 2,419
- 14
- 19
-
+1 for point 6. Retrospective and process improvement you can't buy nor outsource. – slovon Aug 20 '09 at 11:10
There's a lot to be learned from Henry Ford if you are looking to deploy numerous identical (or almost identical) items, of anything.
If you have say 100 brand new servers, all sitting on their packing crates, just begging to be configured, it makes a lot of sense to set up a production line. Obviously before your production line starts work you will need to set up one of them so you know exactly what needs to be done (and often to create your master image that will be deployed onto the other 99). But I digress.
Henry Ford proved that if you want to speed up your production of anything, get one person to do just the one job, but have a lot of people all doing different jobs. E.g.
- One person takes the box and opens it
- One person takes out the styrofoam, puts the manuals and cables somewhere the person later on will use them, and prints a label for the front of the server
- One person takes the server and stacks it in front of the appropriate rack
- One person takes the server, installs the rack kit, mounts it into the rack, plugs in the cables and turns it on.
- One person sets the server up to PXE boot, or inserts the installation media, or whatever
- One person monitors or conducts the installation process
- One person (a different person to who installed it) verifies the installation and makes the minor changes needed (computer name, etc)
Obviously this is going to require more than just one person, but even with two people this can be highly effective. As soon as someone finishes their job they assume the next available job. s also only useful if you have a lot of the same item, and they're all pretty much identical.
By the time they reach the 100th iteration of their job they will be very efficient at it.
Things to note: Don't get the one person doing the same job for too long. In a car construction line we're talking about 3 days in a row, but in server deployment it could be more like 3 hours.
Also, a lot of these tasks are fairly un-skilled (unpacking, screwing in bolts, etc) which means that if you can grab a work experience kid or a brand new intern it can save your own team for doing the more skilled tasks (cabling, OS, etc).
- 68,316
- 31
- 175
- 255
It really depends on what you are trying to provision. I have previously used a setup whereby we have corporate standard configs for servers available in Dell Premier. So we can just log in and order 1 new web server and the config will be already pre-specified.
Once the hardware has arrived, we plug it in, boot from a USB thumb drive. The pre-boot environment asks what role the new server is going to perform. Once the selection is made the server is imaged, configured, and up and running ready to go.
This works well but took a reasonable amnount of effort to set up. It also works because the whole environment is set up to scale horozontally, with servers assigned to very specific roles.
- 871
- 7
- 12
-
-
I have to be honest and say it didn't always happen (usually because the extra capacity was required at the last minute on short notice). But when it did happen we would fire up the machines in the dev/test environment (same deployment process applied to dev and test as well as prod) and leave it running there. Once we were happy it would be shutdown, relocated, and then re-imaged. – Sam Aug 20 '09 at 05:46