0

I am monitoring my servers infrastructure using Icinga2 with some master/satellite configurations.

On Linux and Windows hosts I am monitoring the defaults system metrics like CPU usage and free system memory. On worker nodes, these values often can reach 100% (or 5% free RAM) and thus I am receiving many CRITICAL alarms which are not really troubling.

So, would it better to:

  • simply avoid monitoring free memory and CPU usage
  • set critical alarms on 0% for free memory and 100% for CPU usage
  • continue to monitor them but without receiving any alerts
  • simply discard alerts
  • what else?
Sven
  • 97,248
  • 13
  • 177
  • 225
Mat
  • 1,783
  • 4
  • 22
  • 39
  • We don't like "let's discuss this topic" questions, but there is a definitive answer for the core of your question. – Sven Aug 08 '18 at 11:40

1 Answers1

1

You need to adapt your monitoring thresholds to values that make sense for you specific environment.

As an example, on a computing node, we want to have a CPU utilization of 100%, so this not a useable threshold for alerts. Having a load average that is permanently greater then the number of cores or high I/O wait times might be indication for trouble though, so observe these values in that case and set alerts accordingly.

That aside: If you don't use a value as warning threshold, you don't need to monitor it, but you might do it anyway to keep usage statistics if you need those. Again: Depends on your environment.

Oh, and never have alerts that you discard. This leads to alert fatigue and at some point, you might ignore an important alert because it drowns in all that noise. If you would not act upon an alert, remove it.

Sven
  • 97,248
  • 13
  • 177
  • 225