We're running Logstash on a server which is pushing logs into an Elasticsearch cluster. In the Logstash logs we see it losing connection to the Elasticsearch servers with a connection reset error. We see this between all of our logstash and elasticsearch instances on all servers. In this example we see logstash3 (192.168.1.133) losing connection to elasticsearch4 (192.168.1.88)
[2018-08-21T20:21:21,414][WARN ][logstash.outputs.elasticsearch] Marking url as dead. Last error: [LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError] Elasticsearch Unreachable: [http://elasticsearch4:9200/][Manticore::SocketException] Connection reset {:url=>http://elasticsearch4:9200/, :error_message=>"Elasticsearch Unreachable: [http://elasticsearch4:9200/][Manticore::SocketException] Connection reset", :error_class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError"}
Looking in syslog we can see that UFW is blocking a packet a few seconds before:
Aug 21 20:21:03 logstash3 kernel: [1760565.386970] [UFW BLOCK] IN=enp0s31f6 OUT= MAC=90:1b:0e:cc:50:95:30:b6:4f:d8:05:51:08:00 SRC=192.168.1.88 DST=192.168.1.133 LEN=40 TOS=0x0 PREC=0x00 TTL=57 ID=442 DF PROTO=TCP SPT=9200 DPT=59338 WINDOW=0 RES=0x00 RST URGP=0
We have a rule in place in ufw to allow all connections between these servers. Initially we were just allowing specific ports, but I opened it up in to all ports in case that was the issue. The rule that's allowing connections is:
logstash3:~$ sudo ufw status | grep '\.88'
Anywhere ALLOW 192.168.1.88
If we restart logstash these dropped connections go away for a time (~1 hour), but then start recurring.
Because we see it start happening after a while I'm inclined to think that it's a queue filling up somewhere. The connection doesn't stop dead, we just start losing the occasional packet. I thought it may be related to connection tracking, but we're not seeing any logs to that effect. We've increased conntrack_max
previously as we've seen that as an issue, but it's not helping in this case.
elasticsearch4:~$ cat /proc/sys/net/nf_conntrack_max
262144
No evidence of any generic packet drops:
logstash3:~$ ifconfig
enp0s31f6: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.1.133 netmask 255.255.255.255 broadcast 0.0.0.0
inet6 2a01:4f8:121:53b7::2 prefixlen 64 scopeid 0x0<global>
inet6 fe80::921b:eff:fecc:5095 prefixlen 64 scopeid 0x20<link>
ether 90:1b:0e:cc:50:95 txqueuelen 1000 (Ethernet)
RX packets 7746452399 bytes 8566845158411 (8.5 TB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 5792178105 bytes 5164493104253 (5.1 TB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
device interrupt 16 memory 0xef000000-ef020000
On the advice of @Lenniey I tried disabling UFW and seeing if we still saw the errors in the logstash logs. I disabled ufw on a number of servers and saw that we were still getting Connection reset
messages in the logs for between servers with ufw disabled, so it appears that the ufw entries in syslog may be a symptom rather than a cause.
I produced a packet capture (sudo tcpdump -i enp0s31f6 -s 65535 -w logstash.pcap 'port 9200'
) while we were seeing the error
[2018-08-23T07:20:11,108][WARN ][logstash.outputs.elasticsearch] Marking url as dead. Last error: [LogStash::Outputs::ElasticSearch::
HttpClient::Pool::HostUnreachableError] Elasticsearch Unreachable: [http://elasticsearch3:9200/][Manticore::Socket Exception] Connection reset {:url=>http://elasticsearch3:9200/, :error_message=>"Elasticsearch Unreachable: [http: //elasticsearch3:9200/][Manticore::SocketException] Connection reset", :error_class=>"LogStash::Outputs::ElasticSe arch::HttpClient::Pool::HostUnreachableError"}
In the packet capture I can see elasticsearch respond with a 400 error:
HTTP/1.0 400 Bad Request
content-type: application/json; charset=UTF-8
content-length: 203
{
"error": {
"root_cause": [
{
"type": "illegal_argument_exception",
"reason": "text is empty (possibly HTTP/0.9)"
}
],
"type": "illegal_argument_exception",
"reason": "text is empty (possibly HTTP/0.9)"
},
"status": 400
}
I then see 5 TCP RST packets from elasticsearch to logstash. The last RST packet was at 07:20:11.107663
, which matches the time of the logstash error, 07:20:11,108
.
I'm finding the packet capture a little confusing. I've filtered the packet capture down to the IPs of the logstash and ES server in question, and to the ports where we see the 400 and RSTs, however prior to the 400 there appears to be a _bulk
request in progress, then a 414 response during the request in the middle of the _bulk
request, then the _bulk
request continues, then we get the 400:
Logstash -> ES
POST /_bulk HTTP/1.1
Connection: Keep-Alive
Content-Type: application/json
Content-Length: 167481062
Host: elasticsearch3.channelgrabber.com:9200
User-Agent: Manticore 0.6.1
Accept-Encoding: gzip,deflate
{ ... data ... }
ES -> Logstash
HTTP/1.1 413 Request Entity Too Large
content-length: 0
Logstash -> ES
{ ... data continues ... }
ES -> Logstash
HTTP/1.0 400 Bad Request
content-type: application/json; charset=UTF-8
content-length: 203
{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"text is empty (possibly HTTP/0.9)"}],"type":"illegal_argument_exception","reason":"text is empty (possibly HTTP/0.9)"},"status":400}
Does any have any suggestions on what else I should be looking at?