Does it make sense to monitor free memory and CPU usage on servers?

Question

I am monitoring my servers infrastructure using Icinga2 with some master/satellite configurations.

On Linux and Windows hosts I am monitoring the defaults system metrics like CPU usage and free system memory. On worker nodes, these values often can reach 100% (or 5% free RAM) and thus I am receiving many CRITICAL alarms which are not really troubling.

So, would it better to:

simply avoid monitoring free memory and CPU usage
set critical alarms on 0% for free memory and 100% for CPU usage
continue to monitor them but without receiving any alerts
simply discard alerts
what else?

We don't like "let's discuss this topic" questions, but there is a definitive answer for the core of your question. — Sven, Aug 08 '18 at 11:40

Sven · Accepted Answer · 2018-08-08T11:48:11.503

You need to adapt your monitoring thresholds to values that make sense for you specific environment.

As an example, on a computing node, we want to have a CPU utilization of 100%, so this not a useable threshold for alerts. Having a load average that is permanently greater then the number of cores or high I/O wait times might be indication for trouble though, so observe these values in that case and set alerts accordingly.

That aside: If you don't use a value as warning threshold, you don't need to monitor it, but you might do it anyway to keep usage statistics if you need those. Again: Depends on your environment.

Oh, and never have alerts that you discard. This leads to alert fatigue and at some point, you might ignore an important alert because it drowns in all that noise. If you would not act upon an alert, remove it.

Does it make sense to monitor free memory and CPU usage on servers?

1 Answers1