Logstash losing connection to Elasticsearch nodes

Question

We're running Logstash on a server which is pushing logs into an Elasticsearch cluster. In the Logstash logs we see it losing connection to the Elasticsearch servers with a connection reset error. We see this between all of our logstash and elasticsearch instances on all servers. In this example we see logstash3 (192.168.1.133) losing connection to elasticsearch4 (192.168.1.88)

[2018-08-21T20:21:21,414][WARN ][logstash.outputs.elasticsearch] Marking url as dead. Last error: [LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError] Elasticsearch Unreachable: [http://elasticsearch4:9200/][Manticore::SocketException] Connection reset {:url=>http://elasticsearch4:9200/, :error_message=>"Elasticsearch Unreachable: [http://elasticsearch4:9200/][Manticore::SocketException] Connection reset", :error_class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError"}

Looking in syslog we can see that UFW is blocking a packet a few seconds before:

Aug 21 20:21:03 logstash3 kernel: [1760565.386970] [UFW BLOCK] IN=enp0s31f6 OUT= MAC=90:1b:0e:cc:50:95:30:b6:4f:d8:05:51:08:00 SRC=192.168.1.88 DST=192.168.1.133 LEN=40 TOS=0x0 PREC=0x00 TTL=57 ID=442 DF PROTO=TCP SPT=9200 DPT=59338 WINDOW=0 RES=0x00 RST URGP=0

We have a rule in place in ufw to allow all connections between these servers. Initially we were just allowing specific ports, but I opened it up in to all ports in case that was the issue. The rule that's allowing connections is:

logstash3:~$ sudo ufw status | grep '\.88'
Anywhere                   ALLOW       192.168.1.88

If we restart logstash these dropped connections go away for a time (~1 hour), but then start recurring.

Because we see it start happening after a while I'm inclined to think that it's a queue filling up somewhere. The connection doesn't stop dead, we just start losing the occasional packet. I thought it may be related to connection tracking, but we're not seeing any logs to that effect. We've increased conntrack_max previously as we've seen that as an issue, but it's not helping in this case.

elasticsearch4:~$ cat /proc/sys/net/nf_conntrack_max
262144

No evidence of any generic packet drops:

logstash3:~$ ifconfig
enp0s31f6: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.1.133  netmask 255.255.255.255  broadcast 0.0.0.0
        inet6 2a01:4f8:121:53b7::2  prefixlen 64  scopeid 0x0<global>
        inet6 fe80::921b:eff:fecc:5095  prefixlen 64  scopeid 0x20<link>
        ether 90:1b:0e:cc:50:95  txqueuelen 1000  (Ethernet)
        RX packets 7746452399  bytes 8566845158411 (8.5 TB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 5792178105  bytes 5164493104253 (5.1 TB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
        device interrupt 16  memory 0xef000000-ef020000

On the advice of @Lenniey I tried disabling UFW and seeing if we still saw the errors in the logstash logs. I disabled ufw on a number of servers and saw that we were still getting Connection reset messages in the logs for between servers with ufw disabled, so it appears that the ufw entries in syslog may be a symptom rather than a cause.

I produced a packet capture (sudo tcpdump -i enp0s31f6 -s 65535 -w logstash.pcap 'port 9200') while we were seeing the error

[2018-08-23T07:20:11,108][WARN ][logstash.outputs.elasticsearch] Marking url as dead. Last error: [LogStash::Outputs::ElasticSearch::

HttpClient::Pool::HostUnreachableError] Elasticsearch Unreachable: [http://elasticsearch3:9200/][Manticore::Socket Exception] Connection reset {:url=>http://elasticsearch3:9200/, :error_message=>"Elasticsearch Unreachable: [http: //elasticsearch3:9200/][Manticore::SocketException] Connection reset", :error_class=>"LogStash::Outputs::ElasticSe arch::HttpClient::Pool::HostUnreachableError"}

In the packet capture I can see elasticsearch respond with a 400 error:

HTTP/1.0 400 Bad Request
content-type: application/json; charset=UTF-8
content-length: 203

{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "text is empty (possibly HTTP/0.9)"
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "text is empty (possibly HTTP/0.9)"
  },
  "status": 400
}

I then see 5 TCP RST packets from elasticsearch to logstash. The last RST packet was at 07:20:11.107663, which matches the time of the logstash error, 07:20:11,108.

I'm finding the packet capture a little confusing. I've filtered the packet capture down to the IPs of the logstash and ES server in question, and to the ports where we see the 400 and RSTs, however prior to the 400 there appears to be a _bulk request in progress, then a 414 response during the request in the middle of the _bulk request, then the _bulk request continues, then we get the 400:

Logstash -> ES

POST /_bulk HTTP/1.1
Connection: Keep-Alive
Content-Type: application/json
Content-Length: 167481062
Host: elasticsearch3.channelgrabber.com:9200
User-Agent: Manticore 0.6.1
Accept-Encoding: gzip,deflate
{ ... data ... }

ES -> Logstash

HTTP/1.1 413 Request Entity Too Large
content-length: 0

Logstash -> ES

{ ... data continues ... }

ES -> Logstash

HTTP/1.0 400 Bad Request
content-type: application/json; charset=UTF-8
content-length: 203
{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"text is empty (possibly HTTP/0.9)"}],"type":"illegal_argument_exception","reason":"text is empty (possibly HTTP/0.9)"},"status":400}

Does any have any suggestions on what else I should be looking at?

Do you have "generic" packet drops, e.g. via `ifconfig`: `RX errors 0 dropped 0 overruns 0 frame 0` or `TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0` with something other than 0? — Lenniey, Aug 22 '18 at 08:58
Does this also happen if you temporarily disable your firewall? Or maybe there is some obscure rule in iptables, like `iptables -nvL` and its other variants. — Lenniey, Aug 22 '18 at 09:52
[Here's the output of `iptables -nvL`](https://gist.github.com/nickpeirson/7a3f8015dc882b80a4bfbf9d7179f4ca) — Nick Peirson, Aug 22 '18 at 10:00
@Lenniey, disabling the firewall stopped the syslog messages from UFW, as you'd expect. — Nick Peirson, Aug 22 '18 at 10:16
The logging of course, but did the connection drop again, too? — Lenniey, Aug 22 '18 at 10:16
I temporarily disabled UFW on a number of servers and there are still `Connection reset` errors in the logstash logs. I'll update the question to reflect this as it appears it may not be (entirely) down to UFW. — Nick Peirson, Aug 22 '18 at 10:29

Logstash losing connection to Elasticsearch nodes

0 Answers0