3

I am looking into tool (or advice) that would allow me to track and log all incidents that happen on my infrastructure.

We have a few servers (50+) and that number is going to increase in the future, so I want to have a better overview of things that are going or could go wrong in one month or so and to help me improve those parts of the system or service that are prone to failures.

For example - if a web server has failed on some instance or the backup has not finished because there was no space available on a backup server or there was a DDoS attack, I would like to note that (when, why, where, how did we fixed it and so on).

We have central monitoring systems (Check_MK, Logstash + Kibana, network flow analyzers...) and alerting in place and I can generate availability reporting directly from Check_MK, but that report is not accurate and we share it with our customers. I need this to be for our internal use.

I have researched a little bit and have not found a lot - there is no real standard for this or a tool, so I need an advice from someone who is already dealing with this what tool to use or, if there is no tool (we are pretty much capable of developing one by ourself) what is the best practice when it comes to logging things like this? What do you log?

Igor Hrcek
  • 131
  • 1

2 Answers2

0

Old answer was Off Topic due to misunderstanding. Keeping it for reference:


There are in fact multiple tools which allow what you want.

For example:

  • logstash (which you already know)
  • graylog
  • Prometheus

Every one of them requires you to define triggers in some way on which you would be notified. Diving into this matter for multiple tools is way to much for this platform though.

There are multiple major areas that one would need to consider while building a really helpful monitoring and alerting system.

Gathering/Monitoring/Aggregation of:

  • Availability of systems (hardware, software, services)
  • Errors during operation of those systems (logs, correct responses)
  • Changes over time (metrics of system parameters i.e. disk space and load, response times of services, rollout of newer versions)

Then one would be needed to define levels for alerting:

  • Host/Service Up/Down
  • Process Running
  • Load over x.xx,x.xx,x.xx
  • Disk space under x.xx
  • Data growing rate bigger than x.xx MB/day
  • http 500 responses > x/second
  • etc
  • We already have all that. I need a tool that would me allow to write down reports as described here: http://serverfault.com/questions/243828/network-incident-report-template – Igor Hrcek Oct 14 '16 at 10:37
0

We use our ticket system (atlassian jira) for this stuff:

  • we created a project "operation incidents" with recipients (watchers) enforced on project level
  • and a new task type "incident" where all those items do have their own form fields.

So if some incident happens, we open a new ticket, fill out what we know and keep it current and updated over the length of the incident. After the incident has been fixed and post-processing (root cause analysis mostly) is finished, we close the issue.

Pros:

  • every stakeholder is involved (or at least informed) from the start
  • customer support has a central point to look for information when customers complain
  • ticket system allows for work log and discussion
  • we have an archive for future reference
  • we can use the build-in reporting functions of jira to have reports on KPIs as "time-to-restore" for example
  • So, basically you have a very similar process that I also had in mind. I did a little bit more research on this topic and decided to use Confluence and a specially made template for this purpose. After we log 20-30 incidents, I will create a flexible web tool with its own database and a few interesting features (sorting, search, graphs, piping to logstash/elastic/kibana). – Igor Hrcek Oct 14 '16 at 18:03