3

Last week, after upgrading our GKE cluster to Kubernetes 1.13.6-gke.13, all of our nodes in the cluster started to fail due to high CPU usage. It's the Kubernetes software on the nodes themselves which uses up all the CPU, not the pods.

This is what top shows when we SSH into a node:

top - 10:11:27 up 5 days, 22 min,  1 user,  load average: 23.71, 21.90, 20.32
Tasks: 858 total,   3 running, 854 sleeping,   0 stopped,   1 zombie
%Cpu(s): 33.5 us, 30.9 sy,  0.0 ni, 35.5 id,  0.0 wa,  0.0 hi,  0.1 si,  0.0 st
MiB Mem :  30157.0 total,  14125.6 free,   4771.2 used,  11260.1 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  24762.7 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
      1 root      20   0  114636  22564   4928 R  65.2   0.1   5254:53 systemd
    356 message+  20   0  110104  98784   2916 S  33.8   0.3   2705:49 dbus-daemon
   1236 root      20   0 3174876 147328  57464 S  22.0   0.5   1541:29 kubelet
    366 root      20   0   12020   3864   3396 S  21.6   0.0   1705:21 systemd-logind
   1161 root      20   0 5371428 586832  26756 S  18.7   1.9   1848:18 dockerd
    432 root      20   0 5585144  49040  13420 S  11.1   0.2 848:54.06 containerd
  23797 root      20   0  994620   8784   6088 S   3.0   0.0  96:58.79 containerd-shim
  45359 root      20   0  994620   8968   5600 S   3.0   0.0 100:28.46 containerd-shim
  35913 root      20   0 1068352   8192   5600 S   2.3   0.0 104:54.12 containerd-shim
  10806 root      20   0  994620   8908   5596 S   2.0   0.0 102:57.45 containerd-shim
  15378 root      20   0  994620   9084   5600 S   2.0   0.0 102:24.08 containerd-shim
  33141 root      20   0  994620   8856   5848 S   2.0   0.0  95:26.89 containerd-shim
  34299 root      20   0  994620   8824   5848 S   2.0   0.0  90:55.28 containerd-shim
  48411 root      20   0  994620   9488   6216 S   2.0   0.0  95:38.56 containerd-shim
1824641 root      20   0 1068352   6836   5716 S   2.0   0.0  65:45.81 containerd-shim
  10257 root      20   0  994620   9404   5596 S   1.6   0.0 101:10.45 containerd-shim
  15400 root      20   0 1068352   8916   6216 S   1.6   0.0  97:47.99 containerd-shim
  22895 root      20   0 1068352   8408   5472 S   1.6   0.0 102:55.97 containerd-shim
  29969 root      20   0  994620   9124   5600 S   1.6   0.0  98:32.32 containerd-shim
  34720 root      20   0  994620   8444   5472 S   1.6   0.0  97:23.98 containerd-shim
  10073 root      20   0 1068352   9796   6152 S   1.3   0.0 100:54.30 containerd-shim

After this began, pods on these nodes were CPU starved, worked poorly and got killed due to bad liveness checks.

To attempt to resolve the issue we recreated all the nodes. We created a new pool with equivalent resources and migrated all pods over by scaling down the old pool to 0 nodes. (This was difficult because at least 2 of our previous nodes failed to shut down, even after a long time. In the end we had to shut down the underlying VMs to kill those nodes.) It didn't help.

We also tried switching from Container OS to regular Ubuntu for the nodes. This also did not help.

Next we created yet another pool, this time with twice as much CPU. It helped but didn't fix the underlying problem. Some nodes still had extremely high CPU usage for systemd, dbus-daemon, kubelet etc.

Now we are running with tons of extra CPU per node and it masks the problem. There's enough CPU to also run actual pods in addition to the problematic system services.

But the system as a whole is still wasting tons of resources (and money) so we really need to figure out what's actually going on here. What's systemd doing that's so CPU intensive? What's dbus-daemon up to? We checked various logs (see below) but there are no obvious errors.

How do we find out what's actually wrong here?


Snippet from journalctl -u kubelet:

Jul 04 05:49:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 05:49:34.849808    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 05:54:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 05:54:34.850598    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 05:59:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 05:59:34.851797    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 06:04:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 06:04:34.858344    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 06:09:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 06:09:34.859626    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 06:14:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 06:14:34.861142    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 06:19:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 06:19:34.862185    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 06:24:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 06:24:34.863160    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 06:29:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 06:29:34.864156    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 06:34:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 06:34:34.865041    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 06:39:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 06:39:34.866044    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 06:44:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 06:44:34.866969    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 06:49:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 06:49:34.867979    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 06:54:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 06:54:34.869429    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 06:59:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 06:59:34.870359    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 07:04:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 07:04:34.871190    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 07:09:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 07:09:34.872063    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 07:14:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 07:14:34.873240    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 07:19:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 07:19:34.874123    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 07:24:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 07:24:34.875010    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 07:29:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 07:29:34.876612    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 07:34:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 07:34:34.877420    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 07:39:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 07:39:34.878368    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 07:44:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 07:44:34.879221    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 07:49:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 07:49:34.880239    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 07:54:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 07:54:34.881172    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 07:59:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 07:59:34.881868    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 08:04:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 08:04:34.882653    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 08:09:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 08:09:34.883432    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 08:14:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 08:14:34.884175    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 08:19:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 08:19:34.885043    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 08:24:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 08:24:34.885845    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 08:29:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 08:29:34.886690    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service

Output from strace -c -p `pidof systemd`:

strace: Process 1 attached
strace: Process 1 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 24.20    0.109101           5     22827           lstat
 17.45    0.078654           9      8610           sendmsg
 15.51    0.069914           7     10549           read
 10.96    0.049406           3     17310           getpid
  6.45    0.029075           8      3806           openat
  6.08    0.027385           7      3783           readlinkat
  5.76    0.025945           3      7579           fstat
  4.47    0.020167           4      5700           getrandom
  3.62    0.016301           4      3806           close
  3.14    0.014133           7      1892           access
  2.28    0.010278           5      1924        11 stat
  0.03    0.000145           4        33           epoll_wait
  0.02    0.000089           4        22           readlink
  0.01    0.000029           3        11           prctl
  0.01    0.000029           3        11           clock_gettime
  0.01    0.000027           2        11           getgid
  0.01    0.000026           2        11           geteuid
  0.00    0.000020           2        11           getuid
  0.00    0.000020           2        11           getegid
------ ----------- ----------- --------- --------- ----------------
100.00    0.450744                 87907        11 total

What's strace lstat'ing? Here are some relevant looking snippets I took at random from the full strace output:

...
stat("/dev/sds", {st_mode=S_IFBLK|0660, st_rdev=makedev(65, 32), ...}) = 0
readlinkat(AT_FDCWD, "/sys/dev/block/65:32", "../../devices/pci0000:00/0000:00"..., 99) = 83
lstat("/sys", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
lstat("/sys/dev", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
lstat("/sys/dev/block", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
lstat("/sys/devices", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
lstat("/sys/devices/pci0000:00", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
lstat("/sys/devices/pci0000:00/0000:00:03.0", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
lstat("/sys/devices/pci0000:00/0000:00:03.0/virtio0", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
lstat("/sys/devices/pci0000:00/0000:00:03.0/virtio0/host0", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
lstat("/sys/devices/pci0000:00/0000:00:03.0/virtio0/host0/target0:0:19", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
lstat("/sys/devices/pci0000:00/0000:00:03.0/virtio0/host0/target0:0:19/0:0:19:0", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
lstat("/sys/devices/pci0000:00/0000:00:03.0/virtio0/host0/target0:0:19/0:0:19:0/block", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
lstat("/sys/devices/pci0000:00/0000:00:03.0/virtio0/host0/target0:0:19/0:0:19:0/block/sds", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
access("/sys/devices/pci0000:00/0000:00:03.0/virtio0/host0/target0:0:19/0:0:19:0/block/sds/uevent", F_OK) = 0
openat(AT_FDCWD, "/sys/devices/pci0000:00/0000:00:03.0/virtio0/host0/target0:0:19/0:0:19:0/block/sds/uevent", O_RDONLY|O_CLOEXEC) = 16

...

lstat("/sys", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
lstat("/sys/dev", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
lstat("/sys/dev/block", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
lstat("/sys/devices", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
lstat("/sys/devices/pci0000:00", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
lstat("/sys/devices/pci0000:00/0000:00:03.0", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
lstat("/sys/devices/pci0000:00/0000:00:03.0/virtio0", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
lstat("/sys/devices/pci0000:00/0000:00:03.0/virtio0/host0", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
lstat("/sys/devices/pci0000:00/0000:00:03.0/virtio0/host0/target0:0:18", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
lstat("/sys/devices/pci0000:00/0000:00:03.0/virtio0/host0/target0:0:18/0:0:18:0", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
lstat("/sys/devices/pci0000:00/0000:00:03.0/virtio0/host0/target0:0:18/0:0:18:0/block", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
lstat("/sys/devices/pci0000:00/0000:00:03.0/virtio0/host0/target0:0:18/0:0:18:0/block/sdr", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
access("/sys/devices/pci0000:00/0000:00:03.0/virtio0/host0/target0:0:18/0:0:18:0/block/sdr/uevent", F_OK) = 0

dbus-monitor --system --profile | head -n 20:

dbus-monitor: unable to enable new-style monitoring: org.freedesktop.DBus.Error.AccessDenied: "Rejected send message, 1 matched rules; type="method_call", sender=":1.165" (uid=5004 pid=769854 comm="dbus-monitor --system --profile ") interface="org.freedesktop.DBus.Monitoring" member="BecomeMonitor" error name="(unset)" requested_reply="0" destination="org.freedesktop.DBus" (bus)". Falling back to eavesdropping.
#type    timestamp    serial    sender    destination    path    interface    member
#                    in_reply_to
sig    1562263380.765023    2    org.freedesktop.DBus    :1.165    /org/freedesktop/DBus    org.freedesktop.DBus    NameAcquired
sig    1562263380.953812    132870362    :1.157    <none>    /org/freedesktop/systemd1/unit/var_2dlib_2ddocker_2doverlay2_2d5d1be6e08bfe7552d6f9ee50a943eca88dd0dd749ec248594aa0be91879a2cdb_2dmerged_2emount    org.freedesktop.DBus.Properties    PropertiesChanged
sig    1562263380.957890    132870363    :1.157    <none>    /org/freedesktop/systemd1/unit/var_2dlib_2ddocker_2doverlay2_2d5d1be6e08bfe7552d6f9ee50a943eca88dd0dd749ec248594aa0be91879a2cdb_2dmerged_2emount    org.freedesktop.DBus.Properties    PropertiesChanged
sig    1562263380.957912    132870364    :1.157    <none>    /org/freedesktop/systemd1/unit/var_2dlib_2ddocker_2doverlay2_2d539cd8b8367913182e438aa1b3b05714c8f3f131e20bcadabdeb850c375a8125_2dmerged_2emount    org.freedesktop.DBus.Properties    PropertiesChanged
sig    1562263380.957918    132870365    :1.157    <none>    /org/freedesktop/systemd1/unit/var_2dlib_2ddocker_2doverlay2_2d539cd8b8367913182e438aa1b3b05714c8f3f131e20bcadabdeb850c375a8125_2dmerged_2emount    org.freedesktop.DBus.Properties    PropertiesChanged
sig    1562263380.957923    132870366    :1.157    <none>    /org/freedesktop/systemd1/unit/var_2dlib_2ddocker_2dcontainers_2d4c04ee8d1bf693ff2c9300b198b2b47bbf2c240265af5b9edc1f07b6cbd0c1ce_2dmounts_2dshm_2emount    org.freedesktop.DBus.Properties    PropertiesChanged
sig    1562263380.958014    132870367    :1.157    <none>    /org/freedesktop/systemd1/unit/var_2dlib_2ddocker_2dcontainers_2d4c04ee8d1bf693ff2c9300b198b2b47bbf2c240265af5b9edc1f07b6cbd0c1ce_2dmounts_2dshm_2emount    org.freedesktop.DBus.Properties    PropertiesChanged
sig    1562263380.958020    132870368    :1.157    <none>    /org/freedesktop/systemd1/unit/var_2dlib_2ddocker_2doverlay2_2d872dd2c0a63a9f3b5c9c5e4972e06fcf89d28b4c7f59aea7ea4d38f5a6bf0db6_2dmerged_2emount    org.freedesktop.DBus.Properties    PropertiesChanged
sig    1562263380.958025    132870369    :1.157    <none>    /org/freedesktop/systemd1/unit/var_2dlib_2ddocker_2doverlay2_2d872dd2c0a63a9f3b5c9c5e4972e06fcf89d28b4c7f59aea7ea4d38f5a6bf0db6_2dmerged_2emount    org.freedesktop.DBus.Properties    PropertiesChanged
sig    1562263380.958029    132870370    :1.157    <none>    /org/freedesktop/systemd1/unit/home_2dkubernetes_2dcontainerized_5fmounter_2drootfs_2dvar_2dlib_2dkubelet_2dpods_2d2f4e6eae_5cx2d9e51_5cx2d11e9_5cx2db4ee_5cx2d42010a80000f_2dvolumes_2dkubernetes_2eio_5cx7esecret_2dfluentd_5cx2dgcp_5cx2dtoken_5cx2dzfrkb_2emount    org.freedesktop.DBus.Properties    PropertiesChanged
sig    1562263380.958111    132870371    :1.157    <none>    /org/freedesktop/systemd1/unit/home_2dkubernetes_2dcontainerized_5fmounter_2drootfs_2dvar_2dlib_2dkubelet_2dpods_2d2f4e6eae_5cx2d9e51_5cx2d11e9_5cx2db4ee_5cx2d42010a80000f_2dvolumes_2dkubernetes_2eio_5cx7esecret_2dfluentd_5cx2dgcp_5cx2dtoken_5cx2dzfrkb_2emount    org.freedesktop.DBus.Properties    PropertiesChanged
sig    1562263380.958117    132870372    :1.157    <none>    /org/freedesktop/systemd1/unit/var_2dlib_2dkubelet_2dpods_2d2f4e6eae_5cx2d9e51_5cx2d11e9_5cx2db4ee_5cx2d42010a80000f_2dvolumes_2dkubernetes_2eio_5cx7esecret_2dfluentd_5cx2dgcp_5cx2dtoken_5cx2dzfrkb_2emount    org.freedesktop.DBus.Properties    PropertiesChanged
sig    1562263380.958121    132870373    :1.157    <none>    /org/freedesktop/systemd1/unit/var_2dlib_2dkubelet_2dpods_2d2f4e6eae_5cx2d9e51_5cx2d11e9_5cx2db4ee_5cx2d42010a80000f_2dvolumes_2dkubernetes_2eio_5cx7esecret_2dfluentd_5cx2dgcp_5cx2dtoken_5cx2dzfrkb_2emount    org.freedesktop.DBus.Properties    PropertiesChanged
sig    1562263380.958126    132870374    :1.157    <none>    /org/freedesktop/systemd1/unit/var_2dlib_2ddocker_2doverlay2_2d667d713fe45ea0af609c85fbfd3aafbca9494574fe10509bda8cd3d13d1e6b66_2dmerged_2emount    org.freedesktop.DBus.Properties    PropertiesChanged
sig    1562263380.958223    132870375    :1.157    <none>    /org/freedesktop/systemd1/unit/var_2dlib_2ddocker_2doverlay2_2d667d713fe45ea0af609c85fbfd3aafbca9494574fe10509bda8cd3d13d1e6b66_2dmerged_2emount    org.freedesktop.DBus.Properties    PropertiesChanged
sig    1562263380.958229    132870376    :1.157    <none>    /org/freedesktop/systemd1/unit/var_2dlib_2ddocker_2doverlay2_2d0a9f899f23c0965b43cae81dd04f46a78e4b1b063fa5679323146b06932474a9_2dmerged_2emount    org.freedesktop.DBus.Properties    PropertiesChanged
sig    1562263380.958233    132870377    :1.157    <none>    /org/freedesktop/systemd1/unit/var_2dlib_2ddocker_2doverlay2_2d0a9f899f23c0965b43cae81dd04f46a78e4b1b063fa5679323146b06932474a9_2dmerged_2emount    org.freedesktop.DBus.Properties    PropertiesChanged
sig    1562263380.958238    132870378    :1.157    <none>    /org/freedesktop/systemd1/unit/var_2dlib_2ddocker_2doverlay2_2d875734c124694cd54a7c26516f740cedcb853a18ff5805a89669747c1188b65d_2dmerged_2emount    org.freedesktop.DBus.Properties    PropertiesChanged
  • Both Kubelet and Docker are using `systemd` as [Cgroup Driver](https://kubernetes.io/docs/setup/production-environment/container-runtimes/#cgroup-drivers)? Take a look at [this](https://serverfault.com/q/973131/405239) question to see how to check it. – Eduardo Baitello Jul 06 '19 at 14:33
  • What are the files that systemd is lstatting? It seems to me that, since systemd is widely used and unlikely to be unstable, something else is sending messages that cause systemd to do a lot of work. Maybe something related to logins? – w00t Jul 08 '19 at 05:41
  • @EduardoBaitello that's an interesting post. It looks like both Kubelet and Docker use cgroupfs. This isn't something we specifically configure though, seems to be the default for Google Kubernetes Engine. [1](https://gist.github.com/aljungberg/33c840dadae415796c7d0291e70cae39) – Alexander Ljungberg Jul 08 '19 at 16:17
  • @w00t I added a snippet of strace output related to lstat calls. You're right it's probably not systemd itself that has a problem, probably more likely some kind of pathological behaviour caused by something Kubernetes wants to do with our node. sytemd-logind does use tons of CPU, but I wonder if it's symptom or cause (I imagine there's some infinite loop kind of action that involves systemd, and that systemd in turn depends on systemd-logind for whatever this action is). – Alexander Ljungberg Jul 08 '19 at 16:31
  • Looks like its enumerating devices all the time. Hmm. Maybe that's something it always does; compare the strace with that of a systemd on a normal system, but it seems odd. – w00t Jul 08 '19 at 20:03
  • I wanted to get a comparison between abnormal and normal strace for systemd, but it turns out every node in the cluster now is running abnormally so I have no point of comparison. I checked on systemd from an entirely different cluster (running a different app), but systemd there was basically doing nothing. Most of the time it sat at `epoll_wait`, but once in a while it woke up to handle a watchdog message. I ran strace for a full minute and saw maybe just a dozen system calls. – Alexander Ljungberg Jul 10 '19 at 09:32

2 Answers2

6

After working with Google we finally resolved this issue. The short version is that we had to create a new node pool running GKE 1.12.8-gke.10 and transfer all our pods using persistent volumes to this pool.

Long Version

Here's something I wish I would have thought about right away when debugging this, but I have to give credit to Francisco and Lynn over at Google's support team for this.

A key moment in the process of troubleshooting was when we segmented our load. At one point we designated a specific node pool to each kind of pod we had. This allowed us to answer the question: is the problem specific to a certain kind of pod? We had a suspicion that mounting activity was related to the issue, so it was of particular interest to see if pods with persistent volumes were correlated with node performance degradation.

This turned out to be true. Only nodes running pods with volumes ran hot.

The root cause

The hypothesis from Google's GKE team is that a change in Docker between versions 17.03 and 18.09 results in massive amounts of systemd activity whenever something is exec'ed in the pods. In particular, one of their engineers identified a change in runc which causes all mount units to be reloaded as part of an exec liveness probe.

So three ingredients are the recipe for this issue:

  1. Docker 18 (as used by GKE 13)
  2. Many volume mounts per node
  3. The use of exec for liveness probes (or otherwise I suppose)

We run about 40 Redis pods per node. As is customary, our Redis pods use an exec based liveness probe. Also each pod has a mounted volume for permanent storage. The liveness probe ran every minute. So we had 40 volumes being reloaded 40 times per minute.

Resolving the issue

To resolve the issue we isolated all the Redis load using node taints. We created a new node pool dedicated to Redis, running GKE 1.12.8-gke.10. While GKE does not allow you to downgrade a cluster, it does allow you to create new node pools running older versions. Naturally we disabled auto-upgrade for this new pool.

Then we forced all the Redis pods to migrate over to this new pool.

The results were immediate: we can now run as much load with 6 CPU cores as we could with 24 before and the situation appears to be stable.

  • I can confirm this with 1.13 and 1.14. By adding an small cluster with 1.12 and moving the redis nodes with persistent volume, the cpu usage decreased instant. – Max Oct 22 '19 at 09:47
  • Made a support ticket at google for this, there is an open issue in the google issue tracker: https://issuetracker.google.com/issues/140126874 – Max Oct 22 '19 at 14:17
  • We appear to have hit the exact same issue on AWS EKS in our environment, so it doesn't appear to be Google Cloud specific, but rather something in Kubernetes. We resolved it by changing Redis healthchecks from being a shell command to a network check which instantly relieved the load. – DCKing Jun 25 '20 at 07:56
  • We ultimately did the same thing. We switched to a network based check and it works well. – Alexander Ljungberg Jul 06 '20 at 17:28
0

You need to consider two things here:

  1. To see if anything wrong is happening in your cluster you need to check logs.

There are two types of system components: those that run in a container and those that do not run in a container. For example:

The Kubernetes scheduler and kube-proxy run in a container. The kubelet and container runtime, for example Docker, do not run in containers. On machines with systemd, the kubelet and container runtime write to journald. If systemd is not present, they write to .log files in the /var/log directory. System components inside containers always write to the /var/log directory, bypassing the default logging mechanism.

Use for example journalctl -u kubelet to analyse kubelet logs. Also you can check pod logs.

  1. If there are no errors and there still is a problem with the resources you may need to use autoscaling:
    • HPA scales the number of pod replicas based on cpu/memory usage.
    • Vertical Pods Autoscaler (VPA) allocates more (or less) cpu or memory to existing pods.
    • Cluster Autoscaler (CA) scales your cluster nodes based on pending pods.

Use one or combination of the above to scale your cluster dynamically and without the need of manual setting.

Please let me know if that helped.

  • I added `journalctl -u kubelet` output to the post but there's nothing very interesting there. As for scaling, our CPU requests are around 2.2 per node, while each node has 7.91 CPU allocatable. So there's plenty of available resources in theory, and we already tried both adding more nodes and bigger nodes. – Alexander Ljungberg Jul 04 '19 at 08:40
  • Without the knowledge of your applications and other cluster details I am unable to give you straight answer. Please try [this](https://medium.com/appvia/prevent-resource-starvation-of-critical-system-and-kubernetes-services-83cfeaa6ac96). If it not helps you should look into your applications and/or traffic. – Wytrzymały Wiktor Jul 04 '19 at 09:42
  • I think one factor that suggests it's a low level issue rather than something specific to a pod is that the nodes freeze when we're draining a pool. E.g. Kubernetes is unable to shut nodes down. These are nodes that aren't even running any pods at all since Kubernetes already migrated all the pods off the node in preparation. – Alexander Ljungberg Jul 05 '19 at 11:33