1

I am seeing a problem with monit config in configuring the monit daemon to awaken every few hours and start monitoring the processes which were set to "Not Monitored" state.

PROBLEM: When the monit changes to unmonitor certain process, the status changes to "not monitored" and the monit daemon will NEVER try to start the monitoring of this process again even when the PID file is updated with new correct PID and the monitoring STOPS for this process forever unless the monit daemon is awakened for this process again manually like below.

Can this awakening daemon for each process be configured at certain timeout intervals in the monit config for this process, to avoid of pitfalls of ending up with process going to "not monitored" state forever?

Like if 2 restarts within 3 cycles then timeout {X hours} monitor restart

Thank you.

I have this below config for a snmp process.

# Check for cmaeventd process
check process cmaeventd with pidfile /var/run/cmaeventd.pid
group snmp-agents
start program = "/opt/hp/hp-snmp-agents/storage/etc/cmaeventd start"
stop program = "/opt/hp/hp-snmp-agents/storage/etc/cmaeventd stop"
if 2 restarts within 3 cycles then timeout

For some reason, if the PID file is NOT populated correctly (I am working on fixing it), monit keeps trying to restart the process using the empty pid file throwing the below errors in the monit log and finally "unmonitor" it after it fails to restart within 3 cycles as we configured.

log messages:

[PST Feb  3 11:43:23] error    : monit: Error reading pid from file '/var/run/cmaeventd.pid'
[PST Feb  3 11:43:24] error    : monit: Error reading pid from file '/var/run/cmaeventd.pid'

[PST Feb  3 11:45:25] error    : 'cmaeventd' service restarted 2 times within 2 cycles(s) - unmonitor

Monit status for that process after unmonitor:

Process 'cmaeventd'
  status                            not monitored
  monitoring status                 not monitored
  data collected                    Tue Feb  3 12:10:25 2015

Manually awakening the daemon for this process to start the monitoring again:

>monit monitor cmaeventd 

This will awaken the monit daemon for this process and starts reading the PID file again and if successful it starts the monitoring back in. 

Before awakening the monit daemon for this process:
---------------------------------------------------
logbash-3.1# ls -l /var/run/cmaeventd.pid
-rw-r--r-- 1 root root 1 Feb  3 00:00 /var/run/cmaeventd.pid
logbash-3.1# cat /var/run/cmaeventd.pid

logbash-3.1# ps -ef|grep cmaeventd |grep -v grep
root     13066     1  0 00:00 ?        00:00:00 cmaeventd -p 15 -l /var/log/hp-snmp-agents/cma.log
l
logbash-3.1# echo "13066" > /var/run/cmaeventd.pid
logbash-3.1# cat /var/run/cmaeventd.pid
13066

logbash-3.1# monit monitor cmaeventd

From log:

[PST Feb  3 12:20:54] info     : monitor service 'cmaeventd' on user request
[PST Feb  3 12:20:54] info     : monit daemon at 23515 awakened
[PST Feb  3 12:20:54] info     : Awakened by User defined signal 1
[PST Feb  3 12:20:54] info     : 'cmaeventd' monitor action done

Monit status:

Process 'cmaeventd'
  status                            initializing
  monitoring status                 initializing
  data collected                    Tue Feb  3 12:20:54 2015

Changes to below after sometime:

Process 'cmaeventd'
  status                            running
  monitoring status                 monitored
  pid                               13066
  parent pid                        1
  uptime                            12h 21m
  children                          0
  memory kilobytes                  2160
  memory kilobytes total            2160
  memory percent                    0.0%
  memory percent total              0.0%
  cpu percent                       0.0%
  cpu percent total                 0.0%
  data collected                    Tue Feb  3 12:21:54 2015
gowin09
  • 21
  • 1
  • 3
  • I'd like to help with the Monit side of this, but as an HP ProLiant expert, I'd like to understand why you're using Monit to monitor the cmaeventd process. I wouldn't advise that here. Is there a problem with your HP agents? – ewwhite Feb 03 '15 at 20:52
  • We use snmp agents for detecting hardware issues on the server and sometimes these processes are abruptly terminating for some reason and impacting our proactive monitoring ability to detect any hardware failures on the servers. So we implemented monit to monitor these and maintain its health and improve our ability to overcome this issue. May I know what other option than monit that you would recommend? I appreciate if you could also help me with the monit side of it as implementing a system change might take time and your monit solution will be a quick fix. – gowin09 Feb 03 '15 at 21:08
  • You definitely should not be using Monit to manage the HP Management agents. Can you please give the following: Models/generations of servers, OS distribution and version, and versions of the HP Management Agents. Also please explain how you install the agents. – ewwhite Feb 03 '15 at 21:32
  • Thank you for your insight and replacing monit now is not a fast solution for now and I will try to gather the info you requested and can look into more robust solution from your suggestion. Meanwhile could help with the monit config that can solve the problem that I raised. – gowin09 Feb 03 '15 at 22:09
  • I can't endorse using Monit for this. I'm extremely familiar with both Monit and HP ProLiant workflow. It would be better to determine why your agents are failing. if you can get me **any** version information, we can address the real issue. – ewwhite Feb 03 '15 at 22:20
  • Model: HP ProLiant DL380 G7 OS Version: Linux 2.6.18-194.3.1.el5 HP tools version: hp-health-8.7.0.22-11 – gowin09 Feb 03 '15 at 22:30

1 Answers1

1

It's not necessary to monitor individual HP agents with Monit. Plus, they're all tied together with the wrapper service, hp-snmp-agents. Restarting one independently of the rest will have undesirable effects.

While it's possible to debug the HP agent logs, I think you may have an issue with your old kernel (looks like RHEL/CentOS 5.5) and possibly old HP management agents. The HP agents you should be using are at the SDR repository.

For the ProLiant DL3xx G7 platform, you'll need the newest version of the following packages:

hp-snmp-agents, hpssa, hp-health, hp-smh-templates, hpsmh, hpssacli, hponcfg

ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • Thanks for the response and I am aware that we are running a bit old version here and maintaining the up to date versions is out of my hands. May I know what are the undesirable effects that you are referring to when starting the agents independently as I checked the wrapper does the same thing anyway with no difference. – gowin09 Feb 03 '15 at 23:39
  • I wouldn't waste the time working on the *what-ifs* of the situation. We've already established that you're running a toxic combination of kernel, OS and HP agents. All you need to do is update to newer HP management agents. I'm sorry if that part is not within your control, but running updates is probably the best option. – ewwhite Feb 04 '15 at 03:57
  • Thank you for your suggestion, but it would be nice if you could provide any info on what are the impacts of running the agents with this combination, so that I can justify to the team who owns this to perform an upgrade to our fleet. Is there a way that I can prove that this particular release of the hp-snmp-agents doesn't work well with the corresponding kernel and OS. We cannot do just blind upgrades without finding the root cause of the issues on the existing systems and any info that you could provide in that regard would be helpful for me. Thank you. – gowin09 Feb 04 '15 at 04:40
  • You're OS/kernel is from 2010, and I suspect that your agents are also fairly old. I work extensively with HP servers and can confirm that the HP management agents are stable if you're using recent editions of the software. I'm not sure it's useful to debug your existing problems since we *know* there's newer software available. Your team can test newer HP software in a limited fashion and can make their own risk assessment. – ewwhite Feb 04 '15 at 04:51