Curl Failure when receiving data from peer using Percona XtraDB Cluster Check

Question

I'm having a similar issue to the one mentioned here: xinetd 'connection reset by peer'

I've setup percona-clustercheck (which comes with Percona's XtraDB Cluster packages) with xinetd and I'm getting an error when trying to curl clustercheck REMOTELY. (Note it works just fine locally.)

Here's what it looks like LOCALLY:

[root@db1 tmp]# for i in {1..1000}; do curl http://db1.ourdomain.local:9200; sleep 2; date;  done Percona XtraDB Cluster Node is synced.
Fri May  3 07:30:16 EDT 2013
Percona XtraDB Cluster Node is synced.
Fri May  3 07:30:18 EDT 2013
Percona XtraDB Cluster Node is synced.
Fri May  3 07:30:20 EDT 2013
Percona XtraDB Cluster Node is synced.
Fri May  3 07:30:22 EDT 2013
Percona XtraDB Cluster Node is synced.
Fri May  3 07:30:24 EDT 2013
Percona XtraDB Cluster Node is synced.
Fri May  3 07:30:26 EDT 2013
Percona XtraDB Cluster Node is synced.
Fri May  3 07:30:28 EDT 2013
Percona XtraDB Cluster Node is synced.
Fri May  3 07:30:30 EDT 2013
Percona XtraDB Cluster Node is synced.
Fri May  3 07:30:32 EDT 2013
Percona XtraDB Cluster Node is synced.
Fri May  3 07:30:34 EDT 2013
Percona XtraDB Cluster Node is synced.

and REMOTELY:

[root@db2 ~]# for i in {1..1000}; do curl http://db1.ourdomain.local:9200; sleep 2; date;  done Percona XtraDB Cluster Node is synced.
Fri May  3 07:32:23 EDT 2013
curl: (56) Failure when receiving data from the peer            <----- error
Fri May  3 07:32:25 EDT 2013
curl: (56) Failure when receiving data from the peer            <----- error
Fri May  3 07:32:27 EDT 2013
Percona XtraDB Cluster Node is synced.
Fri May  3 07:32:29 EDT 2013
curl: (56) Failure when receiving data from the peer            <----- error
Fri May  3 07:32:31 EDT 2013
Percona XtraDB Cluster Node is synced.
Fri May  3 07:32:33 EDT 2013
Percona XtraDB Cluster Node is synced.
Fri May  3 07:32:35 EDT 2013
Percona XtraDB Cluster Node is synced.
Fri May  3 07:32:37 EDT 2013

The solution in the prior post was to set the "Content-Length:" but the script I'm using already attempts to set the content-length:

if [[ "${WSREP_STATUS}" == "4" ]] || [[ "${WSREP_STATUS}" == "2" && ${AVAILABLE_WHEN_DONOR} == 1 ]]
then
    # Percona XtraDB Cluster node local state is 'Synced' => return HTTP 200
    # Shell return-code is 0
    echo -en "HTTP/1.1 200 OK\r\n"
    echo -en "Content-Type: text/plain\r\n"
    echo -en "Connection: close\r\n"
    echo -en "Content-Length: 40\r\n"
    echo -en "\r\n"
    echo -en "Percona XtraDB Cluster Node is synced.\r\n"
    exit 0
else
    # Percona XtraDB Cluster node local state is not 'Synced' => return HTTP 503
    # Shell return-code is 1
    echo -en "HTTP/1.1 503 Service Unavailable\r\n"
    echo -en "Content-Type: text/plain\r\n"
    echo -en "Connection: close\r\n"
    echo -en "Content-Length: 44\r\n"
    echo -en "\r\n"
    echo -en "Percona XtraDB Cluster Node is not synced.\r\n"
    exit 1
fi

I tried changing the content-length to zero as was recommended. echo -en "Content-Length: 0\r\n" in both the if and else statements - but that didn't seem to help in my case.

Here's what I see when I run curl in verbose mode:

Fri May  3 08:34:33 EDT 2013
* About to connect() to db1.ourdomain.local port 9200 (#0)
*   Trying 1.2.3.4... connected
* Connected to db1..local (1.2.3.4) port 9200 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.14.0.0 zlib/1.2.3 libidn/1.18 libssh2/1.4.2
> Host: db1.ourdomain.local:9200
> Accept: */*
>
< HTTP/1.1 200 OK
* Closing connection #0
* Failure when receiving data from the peer
curl: (56) Failure when receiving data from the peer

If I use curl locally or if I use telnet remotely it all seems to work properly. Its just curl remotely that's an issue. Unfortunately the hardware load balancer we use requires that I perform an http check (no telnet option).

How can I troubleshoot this further?

Thanks! Brad

EDIT - Adding contents of the xinetd script:

cat /etc/xinetd.d/mysqlchk
# default: on
# description: mysqlchk
service mysqlchk
{
# this is a config for xinetd, place it in /etc/xinetd.d/
    disable = no
    flags           = REUSE
    socket_type     = stream
    port            = 9200
    wait            = no
    user            = nobody
    server          = /usr/bin/clustercheck
    log_type        = FILE /var/log/xinetdlog
    log_on_failure  += USERID
    only_from       = 0.0.0.0/0
    # recommended to put the IPs that need
    # to connect exclusively (security purposes)
    per_source      = UNLIMITED
}

score 1 · Answer 1 · answered May 03 '13 at 13:17

1

What I notice in your debug-session is that your curl closes the connection right after it receives the first line of the response. It doesn't receive the "Content-Length" header, and thus does not use it (that's why it doesn't matter if you set it to 0). With me, it looks like this:

* About to connect() to vm0010 port 9200 (#0)
*   Trying 1.2.3.4... connected
* Connected to vm0010 (1.2.3.4) port 9200 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.21.0 (x86_64-pc-linux-gnu) libcurl/7.21.0 OpenSSL/0.9.8o zlib/1.2.3.4 libidn/1.15 libssh2/1.2.6
> Host: vm0010:9200
> Accept: */*
> 
< HTTP/1.1 200 OK
< Content-Type: text/plain
< Connection: close
< Content-Length: 40
< 
Percona XtraDB Cluster Node is synced.
* Closing connection #0

How is your xinetd configured for this script?

answered May 03 '13 at 13:17

Olaf van Zandwijk

86
1

Added the contents of /etc/xinetd.d/mysqlchk above in the description. – Brad May 03 '13 at 13:23
Packet Capture Demonstrating the issue: https://www.dropbox.com/s/u2b9asn1p5vyh0r/data.pcap – Brad May 03 '13 at 13:23
You can see failed requests don't contain an http payload from the packet capture: https://f.cloud.github.com/assets/4331284/458997/316802aa-b3f4-11e2-91db-daec37664ccf.png – Brad May 03 '13 at 13:24
When I look at the 'wrong' TCP stream in your cap file, I see that the TCP RST (reset) is sent directly after the "HTTP/1.1 200 OK" line. So, the script *does* return something, but not everything that it's supposed to return. My first response to this: since there is no way that the script itself exits after the "HTTP/1.1 200 OK" line, something else must end the connection. – Olaf van Zandwijk May 03 '13 at 14:05
If you start xinetd with `/usr/sbin/xinetd -d -dontfork` it starts in the foreground in debugging mode. Does that tell you anything useful? – Olaf van Zandwijk May 03 '13 at 14:11
The original description is getting quite long so I've pasted the output to pastebin - http://pastebin.com/EmgAZLAr To answer your question though - I don't see a difference in the output when its successful vs failed. – Brad May 03 '13 at 14:24
Any other thoughts? – Brad May 03 '13 at 19:06
I tried several things, but I cannot reproduce this or get similar problems. Like I said before, the script by itself will not exit after printing the "HTTP/1.1 200 OK", so something else is very likely to be the problem. Presumably xinetd or some other network issue. Also, because you told that you don't have this problem locally (only over the network), I think that the problem is not in the script itself. One thing you can try to debug is use `tee`: replace all echo's in the script with `echo ... | tee -a debug.log` and see what gets written there. – Olaf van Zandwijk May 04 '13 at 07:54
Bizarrely if I add "| tee -a debug" to the end of all the echo statements I can't reproduce the error any more! If I remove the "tee -a debug" statements the issue resurfaces. Now obviously I don't want to leave | tee -a debug.log statements in the script as I'll end up with one heck of a big debug.log file eventually. Any ideas why I'd be seeing this odd behavior? – Brad May 06 '13 at 21:35
I got rid of all but one | tee -a /tmp/debug.log command and its still working. But why is that required? Whats different when we do a |tee vs no |tee? – Brad May 06 '13 at 23:05
This also works: echo -en "Percona XtraDB Cluster Node is synced.\r\n" | tee -a /dev/null and echo -en "Percona XtraDB Cluster Node is not synced.\r\n" | tee -a /dev/null. I guess that eliminates the growing debug.log file although seems a bit wasteful and I don't understand why it won't work without it. – Brad May 06 '13 at 23:09

score 1 · Accepted Answer · answered May 08 '13 at 16:20

The problem is that this check script answers HAproxy as a single stream without respecting the HTTP protocol. Introducing sleep is something that seems to work for my setup.

then
    # Cluster node state is 'OK' => return HTTP 200
    /bin/echo -en "HTTP/1.1 200 OK\r\n"
sleep 0.1
    /bin/echo -en "Content-Length: 26\r\n"
sleep 0.1
    /bin/echo -en "Content-Type: text/plain\r\n"
sleep 0.1
    /bin/echo -en "\r\n"
sleep 0.1
    /bin/echo -en "Cluster Node is GOOD.\r\n"
sleep 0.1
    /bin/echo -en "\r\n"
    exit 0
else
    # Cluster node local state is 'BAD' => return HTTP 503
    /bin/echo -en "HTTP/1.1 503 Service Unavailable\r\n"
sleep 0.1
    /bin/echo -en "Content-Length: 0\r\n"
sleep 0.1
    /bin/echo -en "Content-Type: text/plain\r\n"
sleep 0.1
    /bin/echo -en "Connection: close\r\n"
sleep 0.1
    /bin/echo -en "\r\n"
sleep 0.1
    /bin/echo -en "Cluster Node state is BAD.\r\n"
sleep 0.1
    /bin/echo -en "\r\n"
sleep 0.1
    exit 1

This seems to have solved my issue permanently. Thank you! – Brad Jun 18 '13 at 15:53 — Brad, Jun 18 '13 at 15:53

Curl Failure when receiving data from peer using Percona XtraDB Cluster Check

2 Answers2