4

The command being run inside the containers is:

echo never | tee /sys/kernel/mm/transparent_hugepage/enabled

Both containers run as privileged. But in the kubernetes docker container the command fails with error: tee: /sys/kernel/mm/transparent_hugepage/enabled: Read-only file system

and under just plain docker run -it --privileged alpine /bin/sh the command works fine.

I have used docker inspect on both k8s and non-k8s containers to verify privileged status and don't see anything else listed that should cause this problem - I've run diff between both outputs and then used docker run with modifications to try and reproduce the problem in plain docker but failed (it stays working). Any idea why the kubernetes docker container fails and the plain docker container succeeds?

This is reproducible with the pod definition here:

apiVersion: v1
kind: Pod
metadata:
  name: sys-fs-edit
spec:
  containers:
  - image: alpine
    command:
    - /bin/sh
    args:
      - -c
      - echo never | tee /sys/kernel/mm/transparent_hugepage/enabled && sysctl -w net.core.somaxconn=8192 vm.overcommit_memory=1 && sleep 9999999d
    imagePullPolicy: Always
    name: sysctl-buddy
    securityContext:
      privileged: true

Workaround

While I still don't know the cause for the discrepancy, the problem can be mitigated by remounting /sys as read-write.

apiVersion: v1
kind: Pod
metadata:
  name: sys-fs-edit
spec:
  containers:
  - image: alpine
    command:
    - /bin/sh
    args:
      - -c
      - echo never | tee /sys/kernel/mm/transparent_hugepage/enabled && sysctl -w net.core.somaxconn=8192 vm.overcommit_memory=1 && sleep 9999999d
    imagePullPolicy: Always
    name: sysctl-buddy
    securityContext:
      privileged: true
    volumeMounts:
    - mountPath: /sys
      name: sys
      readOnly: false
  volumes:
  - hostPath:
      path: /sys
    name: sys
chrishiestand
  • 974
  • 12
  • 23

2 Answers2

2

On kubernetes it works a bit differently. Setting privileged: true in a securityContext of a container is not enough to be able to modify any sysctl of such container.

Take a look at this section of the official kubernetes docs that describes Using sysctls in a Kubernetes Cluster. As you can read here:

Sysctls are grouped into safe and unsafe sysctls. In addition to proper namespacing, a safe sysctl must be properly isolated between pods on the same node. This means that setting a safe sysctl for one pod

  • must not have any influence on any other pod on the node
  • must not allow to harm the node's health
  • must not allow to gain CPU or memory resources outside of the resource limits of a pod.

By far, most of the namespaced sysctls are not necessarily considered safe. The following sysctls are supported in the safe set:

  • kernel.shm_rmid_forced,
  • net.ipv4.ip_local_port_range,
  • net.ipv4.tcp_syncookies,
  • net.ipv4.ping_group_range (since Kubernetes 1.18).

So in short, there are safe and unsafe sysctls. Most of them are considered as unsafe, even many of those which are namespaced. Unsafe sysctls need to be additionally enabled by the cluster admin on a node-by-node basis:

All safe sysctls are enabled by default.

All unsafe sysctls are disabled by default and must be allowed manually by the cluster admin on a per-node basis. Pods with disabled unsafe sysctls will be scheduled, but will fail to launch.

With the warning above in mind, the cluster admin can allow certain unsafe sysctls for very special situations such as high-performance or real-time application tuning. Unsafe sysctls are enabled on a node-by-node basis with a flag of the kubelet; for example:

kubelet --allowed-unsafe-sysctls \  
'kernel.msg*,net.core.somaxconn' ...

So you cannot simply set any sysctl arbitrarily even from a privileged container running on your kubernetes cluster.

mario
  • 525
  • 3
  • 8
0

The sysctl you're trying to set applies to the entire host, not to a single container. It is not possible to set it within an unprivileged container, which is why you can't do it within Kubernetes, but can do so in a privileged Docker container.

If you need this setting to run particular containers, you should set it on the hosts of all nodes in the cluster, not in container or pod definitions.

Michael Hampton
  • 237,123
  • 42
  • 477
  • 940
  • Notice the "securityContext" in the pod definition - the kubernetes container runs as privileged too. – chrishiestand Aug 25 '16 at 02:12
  • Hm, didn't see that the first time. Even so, this is still something you need to set on each node in the cluster, not in a pod. – Michael Hampton Aug 25 '16 at 02:13
  • I'd hope that to not be the case. And really what I'm looking for is why the command runs fine in a privileged plain docker container and not in a privileged kubernetes docker container. – chrishiestand Aug 25 '16 at 02:15
  • You don't seem to understand. This sysctl applies to the entire container host, not to individual containers. Even if you manage to apply it within a container, you'll affect other things running on that node. If you really need it, it really needs to be set from the host at boot time, or it may negatively impact other running containers. – Michael Hampton Aug 25 '16 at 02:17
  • I understand that fine. Why shouldn't I be able to modify cluster hosts through a privileged container? You can already modify host OSes through various container options like rw mounts or host networking and sysctl changes. – chrishiestand Aug 25 '16 at 02:26