3

I monitor approx. 10 Linux servers with 4 CPU cores each with Zabbix.
I was receiving way to many false alarms from "Processor load is too high" trigger lately.
The "Processor load is too high" trigger expression was:

{Template OS Linux:system.cpu.load[percpu,avg1].avg(5m)}>5 

which is default.

Then I raised 5 to 12 to get less alarms, but somehow thought this is not the best way to deal with it. Therefore I made some Googling and constructed a new trigger.

{Template OS Linux:system.cpu.util[,user].max(5m)}>75

I'd ask the community:

  1. Will new expression reflect REAL CPU overload better than original one?
  2. Would you do it somehow different/better/more optimized?
  3. How would you compose an expression, which would do this:
    The trigger will fire if:

    • 5 min average number of processes waiting in perCPU queue will be more than 3
      AND
    • maximum CPU utilization during the last 5 minutes will be higher than 75 %

I followed the examples in some article and tried with

({Template OS Linux:system.cpu.load[percpu,avg1].avg(5m)}>3
&
{Template OS Linux:system.cpu.util[,user].max(5m)}>75)

but I failed.
Zabbix server returned error:
Incorrect trigger expression. Check expression part starting from " & {Template OS Linux:system.cpu.util[,user].max(5m)}>75)".
Since I'm not some hi expert on Zabbix (yet), the comments will be greatly appretiated. Thanks.

HopelessN00b
  • 53,385
  • 32
  • 133
  • 208
Reb
  • 71
  • 1
  • 2
  • 5

2 Answers2

5

Why is "Processor load is too high" false alarm in your case? It's real symptom for me - CPU is saturated.

IMHO: use only

{Template OS Linux:system.cpu.load[percpu,avg1].avg(5m)}>5 

but threshold depends on your server - what and how is it doing. But >5 value is suspicious for me. Example: CPU usage can be low, but CPU load high - in this case it can be symptom for "slow" IO disk operations (you will need to check metrics CPU iowait usage, disk queue length, ...). Your new combined trigger expression doesn't catch this case.

I recommend article about utilization/saturation from Senior Performance Architect at Netflix: http://www.brendangregg.com/usemethod.html

Jan Garaj
  • 869
  • 1
  • 7
  • 15
3

I would suggest something like this:

{Template OS Linux:system.cpu.load[percpu,avg15].avg(15m)}>1.8

The purpose being that you want a slower response time on your alarms rather than raising to a higher threshold. Often a burst of activity that clears after 5 or 10 minutes isn't really much of a problem, might be perfectly normal depending on what you are doing there. However, if the heavy load persists for a significant length of time that's when you want to know about it. Tweak that threshold 1.8 up or down a bit depending on what your typical workload would look like.

In terms of your expression here:

{Template OS Linux:system.cpu.util[,user].max(5m)}>75

I would not recommend using the max() function in this context because it will be sensitive to even a momentary burst of high activity... unless that's what you really want, but then don't complain about getting many alerts.

Finally, yes you can use boolean expressions, and there's a documentation page to help you. Check this out:

https://www.zabbix.com/documentation/3.2/manual/config/triggers/expression#operators

Tel
  • 31
  • 1