1

I am running a GKE cluster, and sometimes, one of the nodes has issues with specific containers built from php7-alpine.

We run two types of containers, the first type is built from php7-alpine, and the second type is built from the first type. (php7-alpine -> Base App -> App with extra). Only our Base App Pods have these issues.

So far, I've seen the following errors:

  • failed to reserve container name
  • FailedSync: error determining status: rpc error: code = Unknown desc = Error: No such container: XYZ
  • Error: context deadline exceeded context deadline exceeded: CreateContainerError

There is plenty of disk space left on the nodes, kubectl describe pod doesn't contain any relevant/helpful information.

A few more details:

  • Out of 50 Base app, 6 pods are in error, and out of the App with extra pods, none are failing.
  • All failing pods are always on the same node.
  • We've recreated/replaced the nodes. Problem still appear , if we replace the node with faulty pods, we have a 50/50% of having all the pods being OK on the next node. Problem appear somewhat random.
  • Running GKE v1.17.9-gke.1504
  • We are running on preemptible nodes.
  • container image is quite big (~3gb, working on reducing that).
  • Issue started probably around a month ago.

I really have no clues on what to look for, I've look extensively to find a similar issue. Any help is greatly appreciated!

Update:

Here is the deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: my-app
    appType: web
    env: prod
  name: my-app
  namespace: default
spec:
  replicas: 2
  selector:
    matchLabels:
      app: my-app
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
    type: RollingUpdate
  template:
    metadata:
      labels:
        app: my-app
        version: v1.0
    spec:
      containers:
          image: richarvey/nginx-php-fpm:latest  # We build upon that image to add content and services
          lifecycle:
            preStop:
              exec:
                command:
                  - /entry-point/stop.sh
          name: web
          ports:
            - containerPort: 80
              protocol: TCP
          resources:
            requests:
              cpu: 50m
              memory: 1500Mi
        - image: redis:4.0-alpine
          name: redis
          resources:
            requests:
              cpu: 25m
              memory: 25Mi
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
Glorfindel
  • 1,213
  • 3
  • 15
  • 22
  • Could you share your image or any other image that produces the same behaviour? – Mr.KoopaKiller Oct 07 '20 at 07:59
  • @KoopaKiller unfortunately, i cant share the image publicly, but from what i understand, the container doesn't even start, so i am not sure it's relevant ( i can be totally wrong here, not a k8s expert ! ), added the deployment above – username_not_found Oct 07 '20 at 09:02
  • Also, it seems to occur much often when we update image on multiple deployments at the same time `kubectl set image -ltype=abc my/image:latest` – username_not_found Oct 07 '20 at 09:22
  • You said all pods are in the same node, how about the node cpu/memory? How the size of your cluster? – Mr.KoopaKiller Oct 14 '20 at 08:21
  • Have you tried adding `runtime-request-timeout` on your yaml config? – Alex G Aug 03 '21 at 22:18

1 Answers1

0

The issue was investigated and fixed.

https://github.com/containerd/containerd/issues/4604

  • While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. - [From Review](/review/late-answers/512597) – Patrick Mevzek Feb 19 '22 at 10:15
  • The Essential part is that the issue was investigated and fixed. Link is here only for reference. – username_not_found Feb 20 '22 at 12:58
  • "The Essential part is that the issue was investigated and fixed.". No. The essential part would have been exactly what the issue is, and how it was fixed exactly. Both points missing in your answer. – Patrick Mevzek Feb 20 '22 at 20:26