0

I've run into a problem doing a rolling-update of our website, which runs in a container in a pod on our cluster, called website-cluster. The cluster contains two pods. One pod has a container which runs our production website and the other has a container which runs a staging version of the same site. Here is the yaml for the replication controller for the production pod:

apiVersion: v1
kind: ReplicationController
metadata:
  # These labels describe the replication controller
  labels:
    project: "website-prod"
    tier: "front-end"
    name: "website"
  name: "website"
spec:  # specification of the RC's contents
  replicas: 1
  selector:
    # These labels indicate which pods the replication controller manages
    project: "website-prod"
    tier: "front-end"
    name: "website"
  template:
    metadata:
      labels:
        # These labels belong to the pod, and must match the ones immediately above
        # name: "website"
        project: "website-prod"
        tier: "front-end"
        name: "website"
    spec:
      containers:
      - name: "website"
        image: "us.gcr.io/skywatch-app/website"
        ports:
        - name: "http"
          containerPort: 80
        command: ["nginx", "-g", "daemon off;"]
        livenessProbe:
          httpGet:
            path: "/"
            port: 80
          initialDelaySeconds: 60
          timeoutSeconds: 3

We made a change which added a new page to our website. After deploying it to the production pod, we got intermittent 404s when testing the production site. We use the following commands to update the pod (assumes version 95.0 is currently running):

packer build website.json
gcloud docker push us.gcr.io/skywatch-app/website
gcloud container clusters get-credentials website-cluster --zone us-central1-f
kubectl rolling-update website --update-period=20s --image=us.gcr.io/skywatch-app/website:96.0

Here is the output from these commands:

==> docker: Creating a temporary directory for sharing data...
==> docker: Pulling Docker image: nginx:1.9.7
    docker: 1.9.7: Pulling from library/nginx
    docker: d4bce7fd68df: Already exists
    docker: a3ed95caeb02: Already exists
    docker: a3ed95caeb02: Already exists
    docker: 573113c4751a: Already exists
    docker: 31917632be33: Already exists
    docker: a3ed95caeb02: Already exists
    docker: 1e7c116578c5: Already exists
    docker: 03c02c160fd7: Already exists
    docker: f852bb4464c4: Already exists
    docker: a3ed95caeb02: Already exists
    docker: a3ed95caeb02: Already exists
    docker: a3ed95caeb02: Already exists
    docker: Digest: sha256:3b50ebc3ae6fb29b713a708d4dc5c15f4223bde18ddbf3c8730b228093788a3c
    docker: Status: Image is up to date for nginx:1.9.7
==> docker: Starting docker container...
    docker: Run command: docker run -v /tmp/packer-docker358675979:/packer-files -d -i -t nginx:1.9.7 /bin/bash
    docker: Container ID: 0594bf37edd1311535598971140535166df907b1c19d5f76ddda97c53f884d5b
==> docker: Provisioning with shell script: /tmp/packer-shell010711780
==> docker: Uploading nginx.conf => /etc/nginx/nginx.conf
==> docker: Uploading ../dist/ => /var/www
==> docker: Uploading ../dist => /skywatch/website
==> docker: Uploading /skywatch/ssl/ => /skywatch/ssl
==> docker: Committing the container
    docker: Image ID: sha256:d469880ae311d164da6786ec73afbf9190d2056accedc9d2dc186ef8ca79c4b6
==> docker: Killing the container: 0594bf37edd1311535598971140535166df907b1c19d5f76ddda97c53f884d5b
==> docker: Running post-processor: docker-tag
    docker (docker-tag): Tagging image: sha256:d469880ae311d164da6786ec73afbf9190d2056accedc9d2dc186ef8ca79c4b6
    docker (docker-tag): Repository: us.gcr.io/skywatch-app/website:96.0
Build 'docker' finished.
==> Builds finished. The artifacts of successful builds are:
--> docker: Imported Docker image: sha256:d469880ae311d164da6786ec73afbf9190d2056accedc9d2dc186ef8ca79c4b6
--> docker: Imported Docker image: us.gcr.io/skywatch-app/website:96.0
[2016-05-16 15:09:39,598, INFO] The push refers to a repository [us.gcr.io/skywatch-app/website]
e75005ca29bf: Preparing
5f70bf18a086: Preparing
5f70bf18a086: Preparing
5f70bf18a086: Preparing
0b3fbb980e2d: Preparing
40f240c1cbdb: Preparing
673cf6d9dedb: Preparing
5f70bf18a086: Preparing
ebfc3a74f160: Preparing
031458dc7254: Preparing
5f70bf18a086: Preparing
5f70bf18a086: Preparing
12e469267d21: Preparing
ebfc3a74f160: Waiting
031458dc7254: Waiting
12e469267d21: Waiting
5f70bf18a086: Layer already exists
673cf6d9dedb: Layer already exists
40f240c1cbdb: Layer already exists
0b3fbb980e2d: Layer already exists
ebfc3a74f160: Layer already exists
031458dc7254: Layer already exists
12e469267d21: Layer already exists
e75005ca29bf: Pushed
96.0: digest: sha256:ff865acd292409f3b5bf3c14494a6016a45d5ea831e5260304007a2b83e21189 size: 7328
[2016-05-16 15:09:40,483, INFO] Fetching cluster endpoint and auth data.
kubeconfig entry generated for website-cluster.
[2016-05-16 15:10:18,823, INFO] Created website-8c10af72294bdfc4d2d6a0e680e84f09
Scaling up website-8c10af72294bdfc4d2d6a0e680e84f09 from 0 to 1, scaling down website from 1 to 0 (keep 1 pods available, don't exceed 2 pods)
Scaling website-8c10af72294bdfc4d2d6a0e680e84f09 up to 1
Scaling website down to 0
Update succeeded. Deleting old controller: website
Renaming website-8c10af72294bdfc4d2d6a0e680e84f09 to website
replicationcontroller "website" rolling updated

This all looks good, but we were getting random 404s on the new page after this completed. When I ran kubectl get pods I discovered that I had three pods running instead of the expected two pods:

NAME                                                     READY     STATUS    RESTARTS   AGE
website-8c10af72294bdfc4d2d6a0e680e84f09-iwfjo           1/1       Running   0          1d
website-keys9                                            1/1       Running   0          1d
website-staging-34caf57c958848415375d54214d98b8a-yo4sp   1/1       Running   0          3d

Using the kubectl describe pod command I determined that pod website-8c10af72294bdfc4d2d6a0e680e84f09-iwfjo is running the new version (96.0) while pod website-keys9is running the old version (95.0). We are getting the 404s because the incoming request would get randomly served to the old version of website. When I manually delete the pod running the old version, the 404s go away.

Would anyone know under what circumstances the rolling-update would not delete the pod running the old version of the website? Is there something that I need to change in the yaml or the commands in order ensure that the delete of the pod running the old version always happens?

Appreciate any help or advice with this.

  • Suggest using "Deployments" (http://kubernetes.io/docs/user-guide/deployments/) instead of rolling-update. Deployments pushes the logic of orchestrating the update server side, which is generally more robust. Without your kube-controller-manager logs I can't help debug further, though. – beeps May 19 '16 at 19:20

1 Answers1

2

This is Kubernetes bug #27721. But even if it wasn't, you'd still have a moment when your user traffic is being delivered to both old and new pods. That's fine for most applications, but in your case it's undesirable because it leads to unexpected 404s. I suggest that you create the new pod with a labelset that's different from the old one, such as by putting the image version in a label. Then you can update the service to select the new label -- this will quickly (not atomically, but quickly) switch all the traffic from the old service backend to the new one.

But it's probably easier to switch to using Deployments.

aecolley
  • 943
  • 4
  • 15