0

This is a follow-on question from this previous question, created because I found out more information and it's cleaner to pose this as a new question.

I'm using syslog-ng OSE v3.31.2 to receive RFC3164 syslog messages over UDP port 514 from a bunch of clients and write them to both a file, and forward them to telegraf via non-TLS RFC5424 TCP port 601 for insertion into an InfluxDB database.

My syslog-ng config is:

@version: 3.29
@include "scl.conf"

options {
    flush-lines(1);
};
    
source s_network {
    udp(ip(0.0.0.0) port(514));
};

destination d_file {
    file("/var/log/messages");
};
    
destination d_telegraf {
    syslog("telegraf" port(601) transport(tcp));
};
    
log {
    source(s_network);
    destination(d_telegraf);
    destination(d_file);
};

The relevant part of my telegraf config looks like this:

[global_tags]

[agent]
  interval = "100ms"
  round_interval = true
  metric_buffer_limit = 10000
  flush_buffer_when_full = true
  collection_jitter = "0s"
  flush_interval = "100ms"
  flush_jitter = "0s"
  debug = true
  quiet = false

[[outputs.influxdb]]
  urls = ["http://influxdb:8086"]
  database = "logs_db"

[[inputs.syslog]]
  server = "tcp://telegraf:601"

Essentially syslog-ng is set up to forward syslog entries over a TCP connection to telegraf.

The problem is that I'm seeing syslog-ng suffer frequent TCP disconnections from telegraf. These show up in the syslog-ng log as:

[2021-11-17T02:55:32.662972] EOF occurred while idle; fd='12'
[2021-11-17T02:55:32.663102] Syslog connection closed; fd='12', server='AF_INET(192.168.0.6:601)', time_reopen='60'
[2021-11-17T02:56:32.719139] Syslog connection established; fd='12', server='AF_INET(192.168.0.6:601)', local='AF_INET(0.0.0.0:0)'

This disconnection is usually triggered when I send a log to syslog-ng with:

logger -i -d --server localhost test

But if I just leave it all idle I'll also get:

[2021-11-17T02:57:05.392356] EOF on control channel, closing connection;

In these cases, 192.168.0.6 is the telegraf server.

Although I can set the option time-reopen(1) to speed up the reconnection, I'd prefer to find the root cause and prevent the disconnection in the first place.

Is it possible that there is an incompatiblity between syslog-ng and telegraf, that is causing this EOF and an unclean disconnection?

All of this is running within a docker-compose stack on a single host.


EDIT: I've started looking into RFC5424 and RFC6587. Using Wireshark to sniff packets out of syslog-ng, destined for telegraf, I've determined that these are using octet-stuffing (aka non-transparent framing), rather than octet-counting, which telegraf expects by default. The payload of each syslog message to telegraf begins with a "<" character rather than an integer.

I hypothesise that telegraf is accepting these messages but getting stuck parsing them, and therefore closing the connection. The first FIN to close the connection comes from telegraf.

Unfortunately when I set telegraf to accept non-transparent framing it rejects the entire entry and I haven't worked out why yet.

I also haven't yet figured out how to configure syslog-ng to output messages with octet-counting framing.

But at least the EOF message and disconnection has stopped happening. But I'm not sure that means much if telegraf is rejecting all messages outright.

davidA
  • 353
  • 2
  • 11

1 Answers1

0

I've determined that syslog-ng is sending octet-counting framed messages to telegraf.

The cause of this issue is that telegraf is disconnecting the TCP connection from syslog-ng after 5 seconds without receiving a message. This is contrary to the documentation provided with the telegraf syslog plugin, which states that this timeout only applies to the time to receive a single message, and not the time between messages. It may be an English language / interpretation issue though. Setting read_timeout to 0 in the telegraf config is sufficient to prevent telegraf from disconnecting.

davidA
  • 353
  • 2
  • 11