17

I've had discussions in the past with other people in my department about documentation, specifically, level-of-detail and requirements. In their view, documentation is a simple checklist of Y things to do when X things go wrong.

I disagree. I think that this presumes that all issues in IT can easily be boiled down to simple checklists of recovery procedures. I think that it completely ignores the complexity of the situation, and as the other people in the department don't always have a depth of understanding about the issue (which is why I'm writing the document - so they have something to refer to) that the documentation should include some basic background material, such as:

  • Purpose of the (sub)system in question
  • Why it is configured in that manner
  • Expectations of events to occur when the settings/procedures are implemented
  • Potential issues that can cause procedures to fail

However, I'm rather outvoted on this, so my documentation is required to be re-written into a form that says "Steps A-B-C applied in order will resolve problem X". I often hear the lament that it needs to fit onto a single page of paper. Try explaining the configuration of Squid ACLs to someone in this manner, including troubleshooting, through a single-page document. That's just one of a half-dozen documents that are "waiting to be written" as recovery checklists.

Is the method I'm advocating really going overboard? Or are they right, and I should just mind my business here and just write them a simple checklist? My concern is that, no matter how well you write a procedure checklist, it really doesn't solve an issue that requires a SysAdmin to think things through. If you're spending time doing a checklist of recovery procedures that ends up not resolving the issue (because there are additional factors that aren't a part of the document, due to the narrow focus of the document), and the purpose of the document was to avoid re-reading man pages and wikis and websites all over again, then why am I going through the motions? Am I just worrying too much, or is this a real issue?

EDIT:

There currently is no helpdesk position in the department. The audience for the documentation would be for the other admins or for the department head.

Avery Payne
  • 14,326
  • 1
  • 48
  • 87
  • 1
    Regarding your edit: If your department head wants to understand every bit of information he's probably doing vast amounts of micro management. You should talk to him about streamlining some process to make sure that at least one-person on site can work with the given documentation at any time. Not that he understands all of it. – Martin M. Jun 14 '09 at 17:16

9 Answers9

14

Actually neither, we use Documentation As-a-Testcase

That being said we have written documentation that goes with Documentation As-a-Manual. We had checklists in place but when growing we found them to be error prone and really failing on the system as a whole.

We do however have kind of "Documentation As-a-Checklist" installed, that is - as mentioned above - we extensively monitor our services. There's a saying:

If you aren't monitoring it you aren't managing it

That is so totally true, but another one should be:

If you aren't monitoring it you aren't documenting it

Whenever we need to migrate services we just keep the "Service Group", "Host Group", whatever applies (we use Nagios, as you can guess from the vocabulary) open and it isn't migrated as long as there's a single red point on any of the services.

The tests provide a much better checklist than any hand written check list could provide. It's actually self documenting, as soon as we have some failure that wasn't monitored yet the test will at least be added to Nagios, with that we get 2 Things for free:

  • we know when it fails
  • another point on the checklist

The "real" documentation is kept in a Wiki mentioning the odds and ends of the specific service or test. If it's missing people will add it as soon as we need to do some migration or need to do some work with it, so far that approach has worked very good.

Also erronous documentation is ironed out really fast, everytime something fails people start looking up the documentation and try to resolve the issue with the HowTos in there, if it's wrong just update it with your findings.

Just think of the errors one could possibly create by following a checklist and not having installed any tests that will give you a green checkbox after applying them. I don't think it's possible to seperate Documentation from Monitoring.

Martin M.
  • 6,428
  • 2
  • 24
  • 42
11

When writing mine I've always devolved into writing two three sets. The get-er-done checklist, with a MUCH LONGER appendix about the architecture of the system including why things are done the way they are, probable sticking points when coming online, and abstract design assumptions. followed by a list of probable problems and their resolutions, followed by a longer section with information about how a system works, why it does it that way, and other information useful for pointing people in the right direction should something unique happen.

At my last job we were required to write doc so that even level-1 helpdesk people could bring things back up. This required checklists, which generally became out of date within 3 months of the writing. We were strongly urged to write troubleshooting guides whenever possible, but when the contingency tree gets more than three branches in it, you just can't write that doc without going abstract.

When leaving my last job, I turned in a 100 page 'how to do my job' manual before I left. It had the abstract stuff in it, design philosophy, as well as integration points. Since I was presumably writing for another sysadmin who was going to replace me, I aimed it at someone who could take abstract notions and turn them into concrete actions.


Five years have passed and I find my opinion on this has shifted somewhat. Both Document as Manual and Document as Checklist have very valuable places in the hierarchy of documentation and both need to be produced. They target very different audiences, though.

Document as Checklist

The target market for this kind of documentation are coworkers who want to how how to do a thing. They come in two types:

  • Coworkers who just want to know how to do a thing and don't have time to thumb through a fifteen page manual and figure out the steps for themselves.
  • Procedures that are fairly complex in steps, but only need to be run once in a while.

Impatience is the driver for the first kind. Maybe your coworker doesn't actually want to know why the output has to be piped through a 90 character perl regex, just that it has to be in order to close the ticket. Definitely include a statement like, "For a detailed explanation for why this workflow looks like this, follow this link," in the checklist for those that do want to know why.

The second point is for procedures that aren't run often but contain pitfalls. The checklist acts as a map to avoid the Certain Doom of just winging it. If the checklist is kept in a documentation repo, it saves having to search email for the time the old admin sent out a HOWTO.

In my opinion good checklist-documentation also includes sections on possible failure points, and responses to those failures. This can make the document rather large and trigger TL;DR responses in coworkers, so I find that making the failure-modes and their responses a link from the checklist rather than on the page itself makes for an unscary checklist. Embrace hypertextuality.

Document as Manual

The target market for this kind of documentation are people who want to learn more about how a system works. The how-to-do-a-thing style documentation should be able to be derived from this documentation, but more commonly I see it as a supplement to checklist-style documentation to back up the decisions made in the workflow.

This is the documentation where we include such chewy pieces like:

  • Explaining why it's configured this way.
    • This section may include such non-technical issues like the politics surrounding how the whole thing was purchased and installed.
  • Explaining common failure modes and their responses.
  • Explaining any service-level-agreements, both written and de facto.
    • De facto: "if this fails during finals week it's a drop-everything problem. If during summer break, go back to sleep and deal with it in the morning."
  • Setting out upgrade and refactoring goals.
    • The politics may be different later, why don't we fix some of the bad ideas that introduced in the beginning?

Which are all very useful for obtaining a comprehensive understanding of the whole system. You don't need a comprehensive understanding to run simple human-automation tasks, you need it to figure out why something broke the way it did and have an idea where to make it not do that again.


You also mentioned Disaster Recovery documentation that has to be a checklist.

I understand, you have my sympathies.

Yes, DR documentation does need to be as checklist-like as possible.
Yes, DR documentation is the most resistant to checklisting due to how many ways things can break.

If your DR checklist looks like:

  1. Call Dustin or Karen.
  2. Explain the problem.
  3. Stand back.

You have a problem. That is not a checklist, that is an admission that the recovery of this system is so complex it takes an architect to figure out. Sometimes that's all you can do, but try to avoid it if at all possible.

Ideally DR documentation contains procedure checklists for a few different things:

  • Triage procedures to figure out what went wrong, which will help identify...
  • Recovery procedures for certain failure-cases. Which is supported by...
  • Recovery scripts written well beforehand to help minimize human error during recovery.
  • Manual-style documentation about the failure cases, why they occur and what they mean.

Triage procedures are sometimes all the DR documentation you can make for some systems. But having it means the 4am call-out will be more intelligible and the senior engineer doing the recovery will be able to get at the actual problem faster.

Some failure cases have straight-forward recovery procedures. Document them. While documenting them you may find cases where lists of commands are being entered in a specific order, which is a great use-case for scripting; it can turn a 96 point recovery procedure into a 20 point one. You'll never figure out if you can script something until you map the recovery procedure action by action.

Manual-style documentation for failure cases is the last ditch backstop to be used when there ARE no recovery procedures or the recovery procedures failed. It provides the google-hints needed to maybe find someone else who had that problem and what they did to fix it.

sysadmin1138
  • 131,083
  • 18
  • 173
  • 296
  • This sounds very similar to the environment I'm in (minus the helpdesk). +1 for "why I did it that way" – Avery Payne Jun 14 '09 at 05:33
  • @sysadmin1138, You stated *"When leaving my last job, I turned in a 100 page 'how to do my job' manual before I left"*. Why did you do that? Is this actually required? Otherwise, why bother with a 100 pages since you are already leaving? – Pacerier Aug 04 '15 at 05:08
  • 1
    @Pacerier That was... 12 years ago. And I was the *sole admin* covering certain things. Also, I liked that company so wanted a clean hand-off. Other companies have locked me into 2 weeks of 'knowledge sharing sessions' which were intended to do the same kind of thing. Only less reliable, since human memory is really bad. – sysadmin1138 Aug 04 '15 at 18:28
  • `don't have time` could be *`don't have time`*. Overall, reusable experience! – n611x007 Sep 22 '15 at 09:42
5

It depends on the target audience for your documentation.

For helpdesk (level 1) types, a checklist is the correct way to go; of course, this presumes that there are higher levels of support with the deeper knowledge you describe.

If the documentation is for the systems group, I always err on the side of more documentation. It's hard enough to have adequate documentation just to get by, if someone (yourself) wants to write more extensive docs with the requisite background information -- no sane individual should stand in your way!

Joe
  • 1,535
  • 1
  • 10
  • 15
  • +1 Documentation should always be written with the target audience in mind. They are reading the document to get something out of it... is that knowledge or is it how to do something. If its how to do something that may require a bit of extra knowledge I've found putting the extra knowledge in an Appendix is a good way to go. – Paul Rowland Aug 14 '09 at 01:39
5

Personally I try and keep documentation as simple as possible. It tends to include:

  • What [X] is supposed to do (requirements).
  • How [X] has been structured at a high level (design).
  • Why I implemented [X] in the way I did (implementation considerations).
  • How the implementation of [X] is non-standard (workarounds).
  • Common issues with [X] and how to resolve them (issues).

So admittedly my common issues section is likely to be a brief description of what has happened and dot point walkthroughs on how to fix it.

I usually assume some knowledge from the reader of the system in question (unless it is particularly arcane). I don't have time to make most of my technical documentation level 1 or management readable - but a cluey level 3 should be fine.

Neobyte
  • 3,177
  • 25
  • 29
4

I think it obviously depends on the topic. Not everything can be reduced to a simple checklist, and not everything needs a detailed user manual.

It certainly sounds like you're talking about internal documentation, and in my experience you can't just give a list of steps, because there are too many choices. Even creating a user account has some options (in our environment) so our "New Account" document lists the basic steps in order, but for each step has a description of what the variations are.

On the other hand, we never got around to writing much of a document for "How to patch in an office," but our very sketchy document also wasn't a checklist - it mentioned the convention we used for the colours of cables, emphasized that you had to update the spreadsheet that listed the connections, and that was about it.

So, now that I've written this, I totally agree: step-by-step checklists just won't cut it for lots of processes.

Ward - Reinstate Monica
  • 12,788
  • 28
  • 44
  • 59
3

I strongly agree with you on this (in favor of exhaustive documentation) in part because I'm used to having predecessors who did NOT have much interest in docs at all. As has been said in related posts, writing it out is not only good for others, but helps you to more fully understand your environment and solidify it in your own mind. It's an end unto itself.

As an aside, I find that a lot of the pushback comes from an odd belief that crappy/nonexistant documentation = job security. That kind of thinking just seems dishonest and shady.

Kudos to you for bucking the status quo.

Kara Marfia
  • 7,892
  • 5
  • 32
  • 56
3

I document quite a lot, I even have a document priority checklist :-), however I will not document stuff that can be considered generic knowledge (i.e. a reasonable good description of the problem gives an answer within the first page of google).

For anyone interested here is my doc prio checklist (works for me, might not for you, comments and suggestions are highly appreciated):

  1. Create a personal log/diary which you write down everything what you did work/knowledge wise
  2. Services, which service is where, what does it and for whom is it done (any SLA agreements? should one be created?)
  3. Servers, what server is where, how old and when is it written of
  4. As above but for network equipment, UPS and the like
  5. As above but for all user machines
  6. Layout of the physical network (which cable goes where, how long is it and what is the measured quality)
  7. Configuration overview of software and licenses for servers
  8. Configuration overview of software and licenses for user machines
  9. Configuration overview of switches, routers and other dedicated hardware
  10. Contracts and SLA of all externals parties for which my service is directly depending on (ISP, domain etc.)
  11. Set-up a ticket system with integrated wiki to put all the above stuff in it.
  12. For every incident create a ticket (even if you only use it for your self).
  13. Create a script that on Sunday gathers tickets and groups them on problem type.
  14. On Monday create a automatic solution or end-user howto document for the most occuring problem
  15. If time permits, do the next one.
  16. If nothing more to do, look for a new job and give the person who replaces you the log ;-)
1

A checklist is fine, as long as it's not pretending to be complete documentation. The last few documents I wrote came in two parts: XYZ for Users, which included pretty screenshots on how to use it; and XYZ for System Administrators, which included how it was setup, and why (possibly even including the business requirement for it to exist), traps to avoid, and a section on troubleshooting. Troubleshooting is where I'd expect to find the checklists. Perhaps writing the checklists as XYZ for Tech Support might help make a point.

Now, reading between the lines, focusing only on checklists indicates to me a lack of "Big Picture" thinking and long term planning that I'd expect from someone who: has only ever done tech support; or a junior admin just starting out; or is so swamped with work they have no time to think about it; or has never been pushed to think about it; or just plain doesn't care. I don't know which of these, if any, apply in your case.

pgs
  • 3,471
  • 18
  • 19
  • The override is primarily from the department head. But others agree. I still stick to my guns and type up as much as I can with what little time I have left in the day. – Avery Payne Jan 23 '14 at 00:39
1

I'm pretty big on documentation. The company where I work now feels that documentation is important, but some people in the company feel they don't have time to write documentation. This can make life difficult for anybody besides the person that originally did it.

For certain things, I've written step-by-step instructions. You can call this a checklist if you want, but it's not really intended to be physically checked off. I call my documentation style the "kindergarten steps". It's patterned after an MCSE exercise book I had for one of the Windows 2000 exams. The steps are very detailed, but the use of bold/italics/typeface makes it easy to gloss over and just pick out the important parts so you don't need to read everything after you've learned the steps.

Some things don't lend well to step-by-step instructions, so I try to provide as much configuration data as I can. Some technically-inclined person who ends up maintaining down the road will have a better idea of what they are up against, and hopefully it will make their life a little easier when something goes wrong.

Scott
  • 1,173
  • 3
  • 13
  • 25