2

I need to debug sudden load peaks automatically. We already monitor with nagios like check scripts, but the load peaks are seldom and short.

I search for a daemon which checks the load every N seconds and if there is trouble, reports something like ps aux --forest (and iotop --batch)

Graphs created with e.g. munin don't help here, since I need to identify the processes which cause the load.

guettli
  • 3,113
  • 14
  • 59
  • 110

2 Answers2

1

Amongst many possibilities for local process monitoring (choose your poison) is monit, I do something like this in /etc/monit.d/system.conf on centos machines;

check system localhost
    if loadavg (1min) > 6 then alert
    if loadavg (5min) > 6 then alert
    if memory usage > 90% then alert
    if cpu usage (user) > 90% then alert
    if cpu usage (system) > 75% then alert
    if cpu usage (wait) > 75% then alert

I imagine that you might want to be more aggressive with the checks, hence you might want to set the daemon to run checks more often, maybe every 30 seconds until you have determined the problem, and hence would use a /etc/monit.conf something like this;

set daemon  30
set mailserver localhost
#set alert user@gmail.com but not on { instance }
set alert user@gmail.com
include /etc/monit.d/*
set httpd port 2812
        allow 127.0.0.1

If monit does not provide enough information in the default mail alert, then you can have monit execute custom scripts on alert conditions like so;

check system localhost
    if loadavg (1min) > 6 then exec "/bin/bash -c '/usr/bin/top -n1 -b  | /bin/mail -s top-output userXXX@gmail.com'"
    if loadavg (5min) > 6 then exec "/bin/bash -c '/usr/bin/top -n1 -b  | /bin/mail -s top-output userXXX@gmail.com'"
    if cpu usage (user) > 90%  then exec "/bin/bash -c '/usr/bin/top -n1 -b  | /bin/mail -s top-output userXXX@gmail.com'"

(obviously relies on mail command being setup, but you can use local root instead and just check it manually)

Tom
  • 10,886
  • 5
  • 39
  • 62
  • Thank you for your answer. I will try it soon and post my experience here. – guettli May 24 '12 at 08:20
  • i did a quick test, and it seems that the alert email does include the offending pid of the process, so I have updated the answer with a debugging example to include the top output. – Tom May 24 '12 at 08:59
0

perf is the way to go, it's commonly installed by default (linux-tools on Debian).

Use perf top to look at your problem interactively, then use perf stat -p PID to refine by PID. Look at the wiki to find more: https://perf.wiki.kernel.org/index.php/Main_Page

Shadok
  • 623
  • 5
  • 10
  • Thank you for the answer. I read the tutorial of `perf`. It is a great tool for debugging on linux. But I need a automated way to check for the load peaks. I can't inspect them live (logging via ssh and call `perf`). Please correct me if I am wrong, but AFAIK `perf` is not made for running in daemon mode and watching all processes. – guettli May 24 '12 at 08:19