0

Please excuse the naivety in my question, but this is not a subject I know much about at present.

My company is currently running kubernetes-managed fluentd processes to push logs to logstash. These fluentd processes start up and fail immediately after startup and then startup again, etc, etc.

The fluentd processes are running inside Docker containers on a CoreOS AWS instance.

When I look at the any of the logs of the 15 fluentd nodes that are running they all show the same thing. Here is the cut-down version of those logs with some repetitions and the time stamps removed:

Connection opened to Elasticsearch cluster => {:host=>"elasticsearch-logging", :port=>9200, :scheme=>"http"}
process finished code=9 
fluentd main process died unexpectedly. restarting.
starting fluentd-0.12.29
gem 'fluent-mixin-config-placeholders' version '0.4.0'
gem 'fluent-mixin-plaintextformatter' version '0.2.6'
gem 'fluent-plugin-docker_metadata_filter' version '0.1.3'
gem 'fluent-plugin-elasticsearch' version '1.5.0'
gem 'fluent-plugin-kafka' version '0.3.1'
gem 'fluent-plugin-kubernetes_metadata_filter' version '0.24.0'
gem 'fluent-plugin-mongo' version '0.7.15'
gem 'fluent-plugin-rewrite-tag-filter' version '1.5.5'
gem 'fluent-plugin-s3' version '0.7.1'
gem 'fluent-plugin-scribe' version '0.10.14'
gem 'fluent-plugin-td' version '0.10.29'
gem 'fluent-plugin-td-monitoring' version '0.2.2'
gem 'fluent-plugin-webhdfs' version '0.4.2'
gem 'fluentd' version '0.12.29'
adding match pattern="fluent.**" type="null"
adding filter pattern="kubernetes.*" type="parser"
adding filter pattern="kubernetes.*" type="parser"
adding filter pattern="kubernetes.*" type="parser"
adding filter pattern="kubernetes.**" type="kubernetes_metadata"
adding match pattern="**" type="elasticsearch"
adding source type="tail"
adding source type="tail"
adding source type="tail"
...
using configuration file: <ROOT>
   <match fluent.**>
     type null
   </match>
   <source>
     type tail
     path /var/log/containers/*.log
     pos_file /var/log/es-containers.log.pos
     time_format %Y-%m-%dT%H:%M:%S.%NZ
     tag kubernetes.*
     format json
     read_from_head true
   </source>
   <filter kubernetes.*>
     @type parser
     format json
     key_name log
     reserve_data true
     suppress_parse_error_log true
   </filter> 
...
...
   <match **>
     type elasticsearch
     log_level info
     include_tag_key true
     host elasticsearch-logging
     port 9200
     logstash_format true
     buffer_chunk_limit 2M
     buffer_queue_limit 32
     flush_interval 5s
     max_retry_wait 30
     disable_retry_limit 
     num_threads 8
   </match> 
</ROOT>
following tail of /var/log/containers/node-exporter-rqwwn_prometheus_node-exporter-78027c5c818ab42a143fdd684ce2e71bf15cc22e085cfb4f0155854d2248d572.log
following tail of /var/log/containers/fluentd-elasticsearch-0qc6r_kube-system_fluentd-elasticsearch-fccf8db40a19df4a84575c77ac845921386db098d96ef27d1f565da1d928c336.log
following tail of /var/log/containers/node-exporter-rqwwn_prometheus_POD-65ed0741bb78a32e6e129ebc9a96b56284f32d81aba0d66c129df02c9e05fb5b.log
following tail of /var/log/containers/alertmanager-1407110495-s8j6k_prometheus_POD-1807d1ab9c99ce2c4da81fcd5b589e604f4c0dc85cc85a351706b52dc747d21b.log
...
following tail of /var/log/containers/rail-prod-v071-n0zgz_prod_rail-a301220a36cf2a2a537668db44197e2c029f9cc1c60c345218909cd86a84e717.log
Connection opened to Elasticsearch cluster => {:host=>"elasticsearch-logging", :port=>9200, :scheme=>"http"}
process finished code=9
fluentd main process died unexpectedly. restarting.
starting fluentd-0.12.29 
...

I imagine that not enough memory has been configured, or something along those lines for the services to reboot immediately on startup? Does the message "process finished code=9" point to a particular issue?

If someone has seen something like this before please help me with your comments. Thanks.

N Singh
  • 438
  • 3
  • 10
  • 1
    My thought is also that it's likely a memory issues but try running `kubectl get pod crashloopbackoff -o json | jq ".status.containerStatuses[].lastState"` to get some more info on what is going wrong. – Ian Lewis Jul 10 '17 at 01:56
  • Hi Ian, Thanks for this suggestion. I will definitely try it as soon as I can access the instance(s) in question. Currently I do not have access to the machine itself - only the logs that are scraped from it! Having said that - taking one instance at random - it is a CoreOS machine of size m4.2xlarge so is quite a powerful machine. – Paul Pritchard Jul 11 '17 at 05:56
  • Ian, having run the command you suggested above it tells me (although it was not me that actually got to run the command) that OOMKiller is stopping the fluentd daemons. Further, the pods were allocated 200Mi and used that full amount. Three have now been increased to 500Mi and again use the full amount before being terminated by OOMKiller. – Paul Pritchard Jul 12 '17 at 07:53
  • The memory for each pod has now been increased from 200MiB to 1000MiB. This 'seems' to have resolved the restarting issue. However, it seems to have raised other issues now. Ho hum! – Paul Pritchard Jul 19 '17 at 12:06

0 Answers0