This is a follow-on question from this previous question, created because I found out more information and it's cleaner to pose this as a new question.
I'm using syslog-ng OSE v3.31.2 to receive RFC3164 syslog messages over UDP port 514 from a bunch of clients and write them to both a file, and forward them to telegraf via non-TLS RFC5424 TCP port 601 for insertion into an InfluxDB database.
My syslog-ng config is:
@version: 3.29
@include "scl.conf"
options {
flush-lines(1);
};
source s_network {
udp(ip(0.0.0.0) port(514));
};
destination d_file {
file("/var/log/messages");
};
destination d_telegraf {
syslog("telegraf" port(601) transport(tcp));
};
log {
source(s_network);
destination(d_telegraf);
destination(d_file);
};
The relevant part of my telegraf config looks like this:
[global_tags]
[agent]
interval = "100ms"
round_interval = true
metric_buffer_limit = 10000
flush_buffer_when_full = true
collection_jitter = "0s"
flush_interval = "100ms"
flush_jitter = "0s"
debug = true
quiet = false
[[outputs.influxdb]]
urls = ["http://influxdb:8086"]
database = "logs_db"
[[inputs.syslog]]
server = "tcp://telegraf:601"
Essentially syslog-ng is set up to forward syslog entries over a TCP connection to telegraf.
The problem is that I'm seeing syslog-ng suffer frequent TCP disconnections from telegraf. These show up in the syslog-ng log as:
[2021-11-17T02:55:32.662972] EOF occurred while idle; fd='12'
[2021-11-17T02:55:32.663102] Syslog connection closed; fd='12', server='AF_INET(192.168.0.6:601)', time_reopen='60'
[2021-11-17T02:56:32.719139] Syslog connection established; fd='12', server='AF_INET(192.168.0.6:601)', local='AF_INET(0.0.0.0:0)'
This disconnection is usually triggered when I send a log to syslog-ng with:
logger -i -d --server localhost test
But if I just leave it all idle I'll also get:
[2021-11-17T02:57:05.392356] EOF on control channel, closing connection;
In these cases, 192.168.0.6 is the telegraf server.
Although I can set the option time-reopen(1)
to speed up the reconnection, I'd prefer to find the root cause and prevent the disconnection in the first place.
Is it possible that there is an incompatiblity between syslog-ng and telegraf, that is causing this EOF and an unclean disconnection?
All of this is running within a docker-compose stack on a single host.
EDIT: I've started looking into RFC5424 and RFC6587. Using Wireshark to sniff packets out of syslog-ng, destined for telegraf, I've determined that these are using octet-stuffing (aka non-transparent framing), rather than octet-counting, which telegraf expects by default. The payload of each syslog message to telegraf begins with a "<" character rather than an integer.
I hypothesise that telegraf is accepting these messages but getting stuck parsing them, and therefore closing the connection. The first FIN to close the connection comes from telegraf.
Unfortunately when I set telegraf to accept non-transparent framing it rejects the entire entry and I haven't worked out why yet.
I also haven't yet figured out how to configure syslog-ng to output messages with octet-counting framing.
But at least the EOF message and disconnection has stopped happening. But I'm not sure that means much if telegraf is rejecting all messages outright.