Recommendations for high-volume request logs from containers

Question

We have a service that uses raw AWS EC2 instances to field requests. Response times are in the 2-3 ms range. That response time is important to the health and success of the service.

Part of the services job is to send the request details elsewhere so they can be aggregated and presented to the customers for historical purposes. We've solved this by using Syslog to log to a specific facility that's configured in Rsyslog to dump to /var/log/local5.log and we run the AWS Kinesis agent configured to tail that log file and send the JSON logs to a Kinesis stream in batch. We explored using SQS/SNS/Kinesis straight from the app, but all introduces 10-30 ms in to the response times, which made it a non-starter. We also attempted to run a background queue on the local system, but that obviously took up resources and caused more problems than it was worth.

We recently began moving this service to use containers and the question of how to do aggregate those logs. I figured there was a few options:

Run a side care process that listens to syslog requests and send to syslog locally
Run a separate service that scales independently of the application that listens to syslog requests
Something I'm not thinking of...

I attempted to use Vector as a Syslog source on an external service (assign DNS and have the local app log to a remote syslog), but I found the library I was using for logging to a remote Syslog in the application cut off after some message length (~1000 characters / https://github.com/reproio/remote_syslog_sender#message-length) that wasn't big enough for some of the logs we'd like to log.

This made me wonder if the requests we're logging now via Syslog (Ruby Syslog standard library) were being limited. I'm not sure if that's the case, but I'd like to find a way to handle all requests.

Vector was challenging to stand on AWS b/c while you can listen on UDP, I can't get an AWS NLB to do health checks via UDP and it didn't respond via TCP.

Now I'm considering a volume-based solution where we do continue to tail logs. Is this the best approach or am I missing something?

What order of magnitude of delay (and data loss in case of a server dying) is acceptable to you? (*historical purposes* does not sound like you need the logs within the *hour*) — anx, Dec 24 '20 at 15:59
Currently, the logs go to Kinesis and then get run through a sequence of Lambda functions. Ultimately, it takes ~30 sec. to make it everywhere after submission to Syslog. i'd like to stay within a min or so if possible. We can tolerate some loss. Sorry, yeah "historical" in this case that people refer to them later, but there are times when they do look for them within a few min. — user607875, Dec 24 '20 at 16:41

Recommendations for high-volume request logs from containers

0 Answers0