1

I have had a k8s cluster running for a while without issues. A few days ago, it wouldn't start and when the kubelet starts it creates a number of control-plane containers (in an apparent infinite loop) all marked as Created with the following error in Docker and the kubelet logs for every created container:

Handler for POST /v1.40/containers/<ID>/start returned error: unable to find user 0: invalid argument

Background information:

  • the kubelet version was v1.19.1 when I noticed it, the cluster was still running v1.19.0
  • docker is version 19.03.12 on Debian (tested with both Linux 5.7.0-3 and 5.8.0-1 on testing)
  • I initially suspected the upgrade to kubelet v1.19.1 (from v1.19.0) but reversing the upgrade did not address the issue
  • I tried a few docker run -it <image> sh -i with various images but they run fine including when volume mounting with -v and forwarding ports with -p on the command line; these containers run fine
  • re-installing docker with apt purge ... && apt install ... did not resolve the problem even when deleting /var/lib/{docker,kubelet} before re-installing
  • neither did kubeadm reset ... && kubeadm init
  • docker uses the overlay2 storage-driver and the systemd cgroup-driver
  • I do not have logs from when it started as I wasn't using this (development) cluster for a few days and the subsequent errors filled up the logs

Any hints and suggestions are appreciated. The error message is unfortunately not helpful enough to figure out what or even just where in the dependency chain the problem is.

  • I'd guess someone volume mounted `/etc` then overwrote `/etc/passwd` on the host but without knowing what troubleshooting steps you have already tried, it's impossible to help you – mdaniel Sep 14 '20 at 03:52
  • @mdaniel: Thanks. That was an interesting thought but I just checked and the host's /etc/passwd is quite correct, I also just checked by logging in as root and that works too. The bit that bothers me the most is that `docker run` works fine from what I have tested but k8s including `kubeadm --init` does not. – user1885616 Sep 14 '20 at 05:01
  • Hi, does your cluster have only one node? It looks like Apiserver backing storage failed. Try to provide more detailed information regarding your cluster configuration. Check logs as instructed in [k8s troubleshooting steps](https://kubernetes.io/docs/tasks/debug-application-cluster/debug-cluster/) docs page. – Piotr Malec Sep 14 '20 at 14:10
  • @PiotrMalec you are conflating the _docker daemon_ API with the _kubernetes_ API; there is no kubernetes version v1.40 nor does it start with a resource namespace of `containers` – mdaniel Sep 14 '20 at 16:08
  • I found a reference to [a similar sounding issue](https://github.com/portainer/portainer/issues/4319#issuecomment-690204016) and that user was similarly on Debian, so you may want to look at their steps and see if they apply to you – mdaniel Sep 14 '20 at 17:04
  • Thanks @mdaniel ! That certainly sounds like it. I'll do some investigating and decide on an alternative CRI to the packaged one. I'll leave the question open here in case someone can point out the root cause. – user1885616 Sep 15 '20 at 05:26
  • @PiotrMalec: the logs were size-limited and now 'spammed' with the error messages of the kind above, nothing helpful I could find. For anyone else looking into similar issues and trying to debug their problems: this one was a single node, bare metal cluster largely using kubeadm defaults with flannel, host paths for storage PVs. Nothing remarkable or fancy and working perfectly until the unexplained inability to run it at some point in the last few days. – user1885616 Sep 15 '20 at 05:39
  • If `kubeadm reset` and `kubeadm init` didn't work that means that there could be data from previous cluster in etcd. Try to run `rm -rf /var/lib/etcd/*` and try to perform `kubeadm reset` and `kubeadm init` again. – Piotr Malec Sep 15 '20 at 14:27
  • @PiotrMalec: I have tried that too and it didn't work in my case. Replacing Debian's official docker.io package as per the issue mdaniel pointed to did the trick though. I don't have an explanation for this but I have a working cluster again :-) – user1885616 Sep 16 '20 at 12:04
  • I have added community wiki answer for better visibility of the solution to this issue. – Piotr Malec Sep 16 '20 at 12:29

1 Answers1

0

As @mdaniel mentioned in comments the issue might be connected with docker version isntalled.

This issue was solved by following steps from this github post by user ncresswell.

Ok, works fine on Debian 10, even with AppArmour Installed and enforce policy.

I installed Docker CE from the Docker official Repo.. https://docs.docker.com/engine/install/debian/

Please do the same and revert back.. note that we wont be offering any support for Debian 11 until its released as “stable”.

![image]([https://user-images.githubusercontent.com/23178133/92718216-a3be2880-f3b5-11ea-9ef4-e9af69de882b.png](https://user-images.githubusercontent.com/23178133/92718216-a3be2880-f3b5-11ea-9ef4-e9af69de882b.png))

For the record, someone else also experienced this and reported it to the Debian maintainer here: bugs.debian.org/cgi-bin/bugreport.cgi?bug=970525

Piotr Malec
  • 271
  • 1
  • 5
  • 1
    For the record, someone else also experienced this and reported it to the Debian maintainer here: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=970525 – user1885616 Sep 19 '20 at 12:20