0

We are running an Elastic Stack with ECK in EKS (7.8). We noticed that our filebeat daemonset and the AWS module were not processing logs from S3 and our SQS queues backing up. Looking at the logs on our FileBeat containers, we noticed the following repeating error in each of the filebeat pods:

ERROR [s3] s3/input.go:206 SQS ReceiveMessageRequest failed: 
EC2RoleRequestError: no EC2 instance role found caused by: 
EC2MetadataError: failed to make Client request

We have found:

  1. Restarting the pods did not resolve the issue.
  2. Removing and reapplying the daemonset did not resolve the issue.
  3. Restarting the nodes did not resolve the issue.
  4. Terminating the nodes and allowing them to be replaced DID resolve the issue. The filebeat-pods began to process data again.

This issue has repeated many times and at different intervals. The filebeat pods will successfully process data for weeks at a time, or sometimes only a few days before this error recurs.

Although we have a work-around (replacing the nodes) we still wanted to look into why this occurred. Working with AWS we finally tracked down what we feel is a hint as to the root cause.

When the pods successfully start up, it seems that we see a message like the following:

{"log":"2021-02-17T17:08:47.378Z\u0009INFO\u0009[add_cloud_metadata]
\u0009add_cloud_metadata/add_cloud_metadata.go:93\u0009add_cloud_metadata: hosting provider
type detected as aws, metadata={\"account\":{\"id\":\"##########\"},\"availability_zone
\":\"#########\",\"image\":{\"id\":\"#############\"},\"instance\":{\"id\":\"i-#########
\"},\"machine\":{\"type\":\"#######\"},\"provider\":\"aws\",\"region\":\"######
\"}\n","stream":"stderr","time":"2021-02-17T17:08:47.379170444Z"}

If we restart a pod and monitor it on a "failed" node, we see this log message:

{"log":"2021-02-17T17:08:47.439Z\u0009INFO\u0009[add_cloud_metadata]
\u0009add_cloud_metadata/add_cloud_metadata.go:89\u0009add_cloud_metadata: hosting provider
 type not detected.\n","stream":"stderr","time":"2021-02-17T17:08:47.439427267Z"}

We have also verified that the pods themselves CAN hit the EC2 metadata endpoint successfully via curl.

Any information on how to resolve this recurring issue would be appreciated.

Filebeat Config:

    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: filebeat-config
    data:
      filebeat.yml: |-
        filebeat.config:
          modules:
            path: ${path.config}/modules.d/*.yml
            reload.enabled: true

        processors:
          - add_cloud_metadata: ~
          - add_docker_metadata: ~

        output.elasticsearch:
          workers: 5
          bulk_max_size: 5000
          hosts: ["${ES_HOSTS}"]
          username: "${ES_USER}"
          password: "${ES_PASSWORD}"

1 Answers1

0

It's not really an answer as to why the above was happening, but to anyone else who runs into this issue, we changed out the EC2 nodes from m5.xlarge to r5.xlarge and have not yet seen this issue again. This makes me believe that it was likely a memory issue of some kind. For completeness, we also turned on the filebeat logging to 'debug' as suggested above, but doubt that made any difference.