sar -u 1 | awk '{print $9}'

so this will give me "CPU Idle" value every second. I'd like to get email in this case the value goes to "0" 10 times in a row?

What would be the appropriate way to do it?

I found a preliminary solution

sar -u 1 | awk '{ if (int($9)==0) { 
                 i=i+1; {
                           print i, $9
       if (int($9)>=0) {
               if (i>=10) print "sending email"

but in last line where I print "sending email" I can't put call to mutt, like this

sar -u 1 | awk '{ if (int($9)==0) { 
                 i=i+1; {
                           print i, $9
       if (int($9)>=0) {
               if (i>=10) mutt -s "VPNC Problem" test@test.com < /home/semenov/strace.output

the problem is that it says "syntax" error in mutt command call. Any ideas?

  • Monitoring anything every second is borderline abusive unless you're specifically debugging a problem (in which case you really want the whole record over time anyway) -- How would *you* feel if your boss was tapping you on the shoulder every second asking how you were doing? Why aggravate your poor server this way? :-) – voretaq7 Sep 27 '12 at 18:57
  • yes I'm debugging vpnc process that eats 100% cpu time: http://serverfault.com/questions/432428/vpnc-eats-100-cpu-and-strange-output-by-strace I put debug & strace on the process and I simply want to catch the moment it happens first time. Otherwise I'm missing high CPU utilization and server suffer from it. It's not about micromanagement :) - just to help SERVER breath :) – DmitrySemenov Sep 27 '12 at 19:03
  • In that case email isn't likely to help you much (by the time it gets sent out and your client notifies you the spike is probably gone) you're probably better off running a one-shot `sar` every second and logging the time / value. If you still want to do it via email you can wrap my template below in a `while [ true ]; do . . . sleep 1 ; done` and run `sar` as a one-shot each time to get the value. **Beware of inbox overload**: It will kick out a mail any time the condition is met. You could quickly regret asking for such behavior :-) – voretaq7 Sep 27 '12 at 19:08
  • the problem with that process (problem) is that once it's 100% cpu - it continues to stay at 100% CPU for hours. And only restart helps. I want to fix the root of the problem, that's why I want notification when CPU IDLE is 0 for 10 seconds in a row. – DmitrySemenov Sep 27 '12 at 20:36
  • In that case, modify the condition in the script as needed (there are many ways to accomplish that goal. Server Fault is not a "write my scripts for me" site -- it is assumed you can generalize from an example. "I can has teh c0dez?" type questions are a poor fit here.) – voretaq7 Sep 27 '12 at 20:38

The appropriate way to do it is to NOT do it.

CPU Utilization (either %used or %idle) is a bogus value to monitor - it can (and SHOULD) be 100% at various times during normal operation. Do you really want a bunch of alerts because you happened to get 5-10 web requests at the same time your monitoring system checked CPU utilization? I'm betting the answer is no.

Instead you should monitor Load Average (reported by uptime among other tools), which is a measure of the number of processes which want to run right now (the length of RunQ in OS scheduling terms).
The value is usually reported as three values, 1-minute load average ("now"), 5-minute load average, and 15-minute load average.

Load averages below 1 indicate an "unloaded" system (lots of free CPU time, no programs waiting around to execute).
High load averages ("high" being relative to the number of CPUs you have and your system's interactive performance under load) are a cause for concern, and should be investigated.

I typically use 10 as my threshold for load average alarms -- a value high enough that you shouldn't typically see it in production, but low enough that you should have time to respond to the situation once the alarm trips.

The script to monitor in either case is trivial:

# [get your value and stuff it into $value
# Pick an appropriate threshold and stuff it into $threshold
if [ $value -gt $threshold ]; then  # (-gt or -lt as appropriate)
    echo "`hostname` needs attention!" | \
         mail -s "`hostname` monitoring alert" user@host

The getting-and-stuffing part is left as an exercise for the reader.
If you really want to Do It Right you should investigate some monitoring systems and SNMP...

  • It's worth noting that "load average" has different meanings across different systems. What may be true for the Linux kernel is not true for OpenBSD. Make sure to understand what you are monitoring. – Alex Holst Sep 27 '12 at 19:00
  • @AlexHolst true - it's not directly comparable across kernels/kernel families. The biggest difference you often see is that some kernels count processes *currently on the CPU* and some don't - so a 4-CPU system with 8 processes that want the CPU may report a load average of 8 (4 running, 4 in the queue) in the former case, or 4 (just the ones waiting in the queue) in the latter. Also the impact of high load varies depending on how CPU scheduling is handled: Some systems can run with a load average over 100, others can't handle more than about 20. – voretaq7 Sep 27 '12 at 19:03

okay correct command is this

sar -u 1 | awk '{ if (int($9)==0) { 
                 i=i+1; {
                           print i, $9
       if (int($9)>=0) {
               if (i>=10) {
                      print "Sending email";
                      cmd="mutt -s  \"test\" email@domain.com < /home/semenov/strace.output";
