How do I handle post-mortems/incident reporting with Nagios?

Question

I just started using Nagios and I like that my team can acknowledge problems, but I haven't yet found a way to log the solutions that are used to correct the problems. Is there a tool that logs Nagios alerts and provides a way to complete post-mortems and log solutions so that when someone encounters similar problems, they can reference the logged data?

score 3 · Accepted Answer · answered Aug 22 '11 at 06:16

3

Honestly, I don't think trying to capture this information at fault time is useful. You're stressed, possibly still sleepy, at the very least you'll be in a "fight or flight" mode that isn't conducive to writing good documentation. Nagios already has the ability to record quick notes in the service (either as part of the ack, or as a separate note you attach to the service/host); these could be used as part of the post-mortem you should be doing at leisure after the emergency, and then incorporated into a more structured, useful, and better-written piece of documentation that's captured in a wiki and linked to from the service itself in Nagios (via the notes_url field).

answered Aug 22 '11 at 06:16

womble

95,029
29
173
228

I agree with your statement about doing post-mortems the next day and I like the idea of using the notes_url field to tie a service to another URL for extra info. Do you have any recommendations for tools that handle this? I don't think a Wiki is the right choice as I need tools for developer discussions and eventual resolutions. A basic forum might work. – GregB Aug 22 '11 at 21:36
You don't want to link to discussions from a `notes_url` -- you want to link to **the right answer**. That is best recorded in a wiki. That discussions may occur in whatever way you feel is appropriate is orthogonal to the information you want available when you're trying to fix a problem. – womble Aug 22 '11 at 22:53

quanta · Answer 2 · 2011-08-22T05:59:08.357

2

Take a look at event handlers. All you have to do is write a script to handle event and log your solution into a issue tracking system (I like Redmine).

edited Aug 22 '11 at 05:59

answered Aug 22 '11 at 05:45

quanta

50,327
19
152
213

score 0 · Answer 3 · answered Aug 22 '11 at 23:05

Where I work we do it the other way around.

We use a ticketing-system called 'TopDesk' (doesn't really matter). Whenever there is an alert in Icinga (nagios-fork), this creates a ticket via an HTTP-request to the TopDesk-server.

So, I think it's easier to let nagios send out warnings/errors via mail, sms and a ticketing system then using it to keep track of the actions taken.

How do I handle post-mortems/incident reporting with Nagios?

3 Answers3