OS: CentOS release 5.7 (Final) Net-SNMP: net-snmp- (from RPM)

Periodically my NMS notifies me that SNMP has gone down on this machine. The service is restored in between 10 to 30 minutes. My NMS also pings and check SSH and those services are not affected during the SNMP outage.

SNMPD log file shows that it is working and apparently receiving packets (either from local agents from or from my NMS at however attempting to snmpwalk locally or from the NMS system fails with a timeout.

I have 7 of these servers running mixture of CentOS 5.7 and RHEL 5.7 with this specific version of Net-SNMP installed from RPM - none of them have this issue except this one. 5 of the machines (including the NMS system and this problem server) are in the same rack connected using one switch.

Restarting SNMPD does not fix the issue - it clears up by itself eventually. Any suggestions where I can begin diagnosing the issue? It's a closed subnet so IPTables is not used. SNMPD config below:

# Following entries were added by HP Insight Management Agents at
#      Tue May 15 10:58:17 CLT 2012
dlmod cmaX /usr/lib64/libcmaX64.so
rwcommunity public
rocommunity public
rwcommunity 3adRabRu
rocommunity 3adRabRu
rwcommunity 3adRabRu
rocommunity 3adRabRu
trapcommunity callmetraps
trapsink callmetraps
trapsink callmetraps
syscontact Lukasz Piwowarek
syslocation Santiago, Chile
# ---------------------- END --------------------
agentAddress udp:161
com2sec rwlocal default public
com2sec rolocal default public
com2sec subnet  default 3adRabRu
group   rwv2c   v2c             rwlocal
group   rov2c   v2c             rolocal
group   rov2c   v2c             subnet
view    all     included        .1
access  rwv2c   ""      any             noauth          exact   all     all     none
access  rov2c   ""      any             noauth          exact   all     none    none
  • You want to grep for :161, where snmpd listens - not 162, which is where traps are sent. – rnxrx May 31 '12 at 21:13
  • Derp. Wow my brain is on the slideshow. Thanks for pointing that out. – Lukasz Jun 01 '12 at 16:37
  • Can you share which NMS you're using? – ewwhite Jun 01 '12 at 16:48
  • OpenNMS 1.10 - like I said - I cannot believe this is NMS related since other nodes do not have this issue despite nearly identical configuration (OS installed and setup at the same date). – Lukasz Jun 01 '12 at 16:54
  • Well, the monitoring system *does* matter. It can help us help you debug... See more in my answer below. – ewwhite Jun 01 '12 at 17:10

3 Answers3


There are a few issues to address on this one.

Looking at your config, I see OpenNMS as the monitoring solution, HP ProLiant server hardware, possible package version and driver issues, and a couple of tweaks you could possibly make to your snmpd options.

Are you on the most recent version of OpenNMS? The current revision is 1.10.3 Is the machine you're polling the NMS system or unrelated? Was this a problem with an older version of OpenNMS, or is this a new installation?

I also see a module for the HP ProLiant Management Agents loaded in the first line of your snmpd.conf config. That feeds the ProLiant Support Pack and HP health agents. Is this the only HP server you're monitoring? To test the HP snmp config, can you access the System Management Homepage at https://server.ip:2381 ? Do the system sensors (temperature, storage, ILO) show up properly? If they don't, there's a problem with your SNMP setup.

On the OpenNMS side, there are incredibly flexible logging options available for the poller. We can help you get the info you need, but I don't think this is a general OpenNMS problem if it's only affecting one node. You could remove the node from the database and rediscover it to test this theory.

For the host in question, you may want to edit /etc/sysconfig/snmpd.options to reduce log verbosity in case that's an issue.

My guess is that it's an OpenNMS polling/DB issue or that it's the interaction of the HP agents and snmp on the single problem system.

  • If you don't care about the node's historical data, I'd recommend just deleting it from OpenNMS, waiting 10 minutes and rescanning to discover it again. It'll get a new node ID at that point, but it's the easiest thing to check given what you described. – ewwhite Jun 01 '12 at 19:23
  • 1. OpenNMS is currently monitoring a mixture of 9 HP ProLiant DL360 G7 servers, 4 switches and 4 network tap systems. All of these are monitored via SNMP/PING/HTTP checks/SSH checks etc.). All of them work fine except this one server. 2. When the issue is reported by OpenNMS I cannot snmpwalk locally on the affected server. So this is NOT an OpenNMS issue. 3. If there is no problem I can do snmpwalks from any permitted machine to get the temperature values / statuses etc. from the HP SNMP Agents without issues. When there is a problem I cannot snmpwalk anything locally or remotely. – Lukasz Jun 01 '12 at 19:31
  • 4. I dont have any way to access the SMH because all systems are remote to me (they are on another continent) and do not have any UI installed. I will see if I can tunnel to that port somehow and do a test. – Lukasz Jun 01 '12 at 19:32

Have you tried increasing the SNMP timeout and retries on the NMS? It could be that your server is not answering fast enough sometimes or that your network loses packets.

And, as @rnxrx already pointed out, you need to look for port 161 to see if snmpd is listening.

  • My NMS checks the system every 30 seconds - it checks by ping, it checks if SSH is up on port 22 and it polls the system group from Net-SNMP for CPU/Memory/Filesystem usages and such. Only SNMP times out and it does not work for 10 to 30 minutes at a time. Second these two machines are in the same rack (along with 3 other servers) and they are connected using one switch - so you will agree that the chance of this being a networking issue is slim. All 5 systems (plus 2 in another rack) are running mixture of CentOS/RHEL 5.7 with the above mentioned Net-SNMP from RPM. – Lukasz Jun 01 '12 at 16:39

I have found the cause but no solution. It seems MySQL is making the entire system unresponsive. How it manages to affect everything from SNMP through SSH and overall system responsiveness (commands that should be instant take upwards of 30 seconds to respond) is beyond me. This is a dual CPU machine with 96GB of RAM that is used in 4 hour bursts of extremely heavy data correlation but after we run our program (which does several million inserts into MySQL) the whole system just crawls even though it's near idle. Restarting MySQL clears the issue right away.

  • Look at your [I/O elevator settings](http://serverfault.com/questions/373563/linux-real-world-hardware-raid-controller-tuning-scsi-and-cciss) for the system (should be `noop` or `deadline`) and also check the values of your [virtual memory subsystem](http://www.cyberciti.biz/faq/linux-kernel-tuning-virtual-memory-subsystem/), particularly the vm.dirty_background_ratio value. – ewwhite Jun 09 '12 at 17:56