I am looking into tool (or advice) that would allow me to track and log all incidents that happen on my infrastructure.
We have a few servers (50+) and that number is going to increase in the future, so I want to have a better overview of things that are going or could go wrong in one month or so and to help me improve those parts of the system or service that are prone to failures.
For example - if a web server has failed on some instance or the backup has not finished because there was no space available on a backup server or there was a DDoS attack, I would like to note that (when, why, where, how did we fixed it and so on).
We have central monitoring systems (Check_MK, Logstash + Kibana, network flow analyzers...) and alerting in place and I can generate availability reporting directly from Check_MK, but that report is not accurate and we share it with our customers. I need this to be for our internal use.
I have researched a little bit and have not found a lot - there is no real standard for this or a tool, so I need an advice from someone who is already dealing with this what tool to use or, if there is no tool (we are pretty much capable of developing one by ourself) what is the best practice when it comes to logging things like this? What do you log?