My check_mk server connects to several RHEL nodes which have installed check_mk_agent (version 1.2.4p3). A group of these nodes belong to a pacemaker cluster.
The check_mk agent is configured by default -- a xinet service is configured bound to the port 6556/TCP:
service check_mk
{
type = UNLISTED
port = 6556
socket_type = stream
protocol = tcp
wait = no
user = root
server = /usr/bin/check_mk_agent
# If you use fully redundant monitoring and poll the client
# from more then one monitoring servers in parallel you might
# want to use the agent cache wrapper:
#server = /usr/bin/check_mk_caching_agent
# configure the IP address(es) of your Nagios server here:
#only_from = 127.0.0.1 10.0.20.1 10.0.20.2
# Don't be too verbose. Don't log every check. This might be
# commented out for debugging. If this option is commented out
# the default options will be used for this service.
log_on_success =
disable = no
}
One of these cluster nodes has problems when a socket is open to the 6556/TCP port because the /usr/bin/check_mk_agent script hangs in the cluster detection stage:
crm_mon -1 -r | grep ···
This makes the check_mk server report issues on that node.
When I comment out the cluster detection commands in the check_mk_agent script, it works fine
# Heartbeat monitoring
# Different handling for heartbeat clusters with and without CRM
# for the resource state
###if [ -S /var/run/heartbeat/crm/cib_ro -o -S /var/run/crm/cib_ro ] || pgrep crmd > /dev/null 2>&1; then
### echo '<<<heartbeat_crm>>>'
### crm_mon -1 -r | grep -v ^$ | sed 's/^ //; /^\sResource Group:/,$ s/^\s//; s/^\s/_/g'
###fi
###if type cl_status > /dev/null 2>&1; then
### echo '<<<heartbeat_rscstatus>>>'
### cl_status rscstatus
###
### echo '<<<heartbeat_nodes>>>'
### for NODE in $(cl_status listnodes); do
### if [ $NODE != $(echo $HOSTNAME | tr 'A-Z' 'a-z') ]; then
### STATUS=$(cl_status nodestatus $NODE)
### echo -n "$NODE $STATUS"
### for LINK in $(cl_status listhblinks $NODE 2>/dev/null); do
### echo -n " $LINK $(cl_status hblinkstatus $NODE $LINK)"
### done
### echo
### fi
### done
###fi
This problem is not found in the remaining cluster nodes.
I'm sure it is not a network problem because the same behaviour occurs when the connection is opened from inside that faulty node:
telnet 127.0.0.1 6556
The most strange is that I run the command crm_mon -1 -r
manually a lot of times a day but it never hangs.
What can make the command crm_mon -1 -r
to hang in just one node when it is executed with no terminal attached?
Thanx in advance
update 1
I have created a new xinetd service similar to check_mk one but changing the name, the port number and the server. The server script only contains these lines
#!/bin/bash
unset LANG
export LC_ALL=C
date
/usr/sbin/crm_mon -1 -r -N
#/usr/sbin/crm_resource -L
date
and it hangs too. I've even tried to use crm_resource -L
, whose output is the same, but it hangs too:
# telnet 127.0.0.1 6557
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
Fri Jul 14 08:37:36 CEST 2017
update 2
The SELinux configuration is Disabled
.