check_mk_agent hangs when running cluster check ONLY when triggered by network

Question

My check_mk server connects to several RHEL nodes which have installed check_mk_agent (version 1.2.4p3). A group of these nodes belong to a pacemaker cluster.

The check_mk agent is configured by default -- a xinet service is configured bound to the port 6556/TCP:

service check_mk
{
        type           = UNLISTED
        port           = 6556
        socket_type    = stream
        protocol       = tcp
        wait           = no
        user           = root
        server         = /usr/bin/check_mk_agent

        # If you use fully redundant monitoring and poll the client
        # from more then one monitoring servers in parallel you might
        # want to use the agent cache wrapper:
        #server         = /usr/bin/check_mk_caching_agent

        # configure the IP address(es) of your Nagios server here:
        #only_from      = 127.0.0.1 10.0.20.1 10.0.20.2

        # Don't be too verbose. Don't log every check. This might be
        # commented out for debugging. If this option is commented out
        # the default options will be used for this service.
        log_on_success =

        disable        = no
}

One of these cluster nodes has problems when a socket is open to the 6556/TCP port because the /usr/bin/check_mk_agent script hangs in the cluster detection stage:

crm_mon -1 -r | grep ···

This makes the check_mk server report issues on that node.

When I comment out the cluster detection commands in the check_mk_agent script, it works fine

# Heartbeat monitoring
# Different handling for heartbeat clusters with and without CRM
# for the resource state
###if [ -S /var/run/heartbeat/crm/cib_ro -o -S /var/run/crm/cib_ro ] || pgrep crmd > /dev/null 2>&1; then
###  echo '<<<heartbeat_crm>>>'
###  crm_mon -1 -r | grep -v ^$ | sed 's/^ //; /^\sResource Group:/,$ s/^\s//; s/^\s/_/g'
###fi
###if type cl_status > /dev/null 2>&1; then
###  echo '<<<heartbeat_rscstatus>>>'
###  cl_status rscstatus
###
###  echo '<<<heartbeat_nodes>>>'
###  for NODE in $(cl_status listnodes); do
###    if [ $NODE != $(echo $HOSTNAME | tr 'A-Z' 'a-z') ]; then
###      STATUS=$(cl_status nodestatus $NODE)
###      echo -n "$NODE $STATUS"
###      for LINK in $(cl_status listhblinks $NODE 2>/dev/null); do
###        echo -n " $LINK $(cl_status hblinkstatus $NODE $LINK)"
###      done
###      echo
###    fi
###  done
###fi

This problem is not found in the remaining cluster nodes.

I'm sure it is not a network problem because the same behaviour occurs when the connection is opened from inside that faulty node:

telnet 127.0.0.1 6556

The most strange is that I run the command crm_mon -1 -r manually a lot of times a day but it never hangs.

What can make the command crm_mon -1 -r to hang in just one node when it is executed with no terminal attached?

Thanx in advance

update 1

I have created a new xinetd service similar to check_mk one but changing the name, the port number and the server. The server script only contains these lines

#!/bin/bash
unset LANG
export LC_ALL=C

date
/usr/sbin/crm_mon -1 -r -N
#/usr/sbin/crm_resource -L
date

and it hangs too. I've even tried to use crm_resource -L, whose output is the same, but it hangs too:

# telnet 127.0.0.1 6557
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
Fri Jul 14 08:37:36 CEST 2017

update 2

The SELinux configuration is Disabled.

As soon as someone says "I'm sure it is not a network problem", I immediately think it is a network problem. ;) — Jesse Adelman, Jul 14 '17 at 01:51

score 1 · Answer 1 · answered Jul 19 '17 at 16:55

What is your SELinux configuration?

Check_mk invoked through xinetd would have a difference context than if invoked at a root shell. I've seen this get in the way of Nagios remote plugin executor, seems like it could have the same effect on check_mk.

See if SELinux is enforcing:

$ getenforce

Set it to permissive and see if the problem persists:

$ setenforce 0

If that is the issue, I'd recommend tuning the SELinux policy with autdit2allow instead of disabling SELinux.

See this link for information about using audit2allow: https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Security-Enhanced_Linux/sect-Security-Enhanced_Linux-Fixing_Problems-Allowing_Access_audit2allow.html

Try tracing the process with strace to see where it hangs. I.e. strace -o outputfile /usr/sbin/crm_mon -1 -r -N — Steve F, Jul 21 '17 at 10:50

check_mk_agent hangs when running cluster check ONLY when triggered by network

1 Answers1