I will attach the minimized test case below. However, it is a simple Dockerfile that has these lines:
VOLUME ["/sys/fs/cgroup"]
CMD ["/lib/systemd/systemd"]
It is Debian:buster-slim based image, and runs systemd inside the container. Effectively, I used to run the container like this:
$ docker run --name any --tmpfs /run \
--tmpfs /run/lock --tmpfs /tmp \
-v /sys/fs/cgroup:/sys/fs/cgroup:ro -it image_name
It used to work fine before I upgraded a bunch of host Linux packages. The host kernel/systemd now seems to default cgroup v2. Before, it was cgroup. It stopped working. However, if I give the kernel option so that the host uses cgroup, then it works again.
Without giving the kernel option, the fix was to add --cgroupns=host
to docker run
besides mounting /sys/fs/cgroup
as read-write (:rw
in place of :ro
).
I'd like to avoid forcing the users to give the kernel option. Although I am far from an expert, forcing the host namespace for a docker container does not sound right to me.
I am trying to understand why this is happening, and figure out what should be done. My goal is to run systemd inside a docker, where the host follows cgroup v2.
Here's the error I am seeing:
$ docker run --name any --tmpfs /run --tmpfs /run/lock --tmpfs /tmp \
-v /sys/fs/cgroup:/sys/fs/cgroup:rw -it image_name
systemd 241 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 default-hierarchy=hybrid)
Detected virtualization docker.
Detected architecture x86-64.
Welcome to Debian GNU/Linux 10 (buster)!
Set hostname to <5e089ab33b12>.
Failed to create /init.scope control group: Read-only file system
Failed to allocate manager object: Read-only file system
[!!!!!!] Failed to allocate manager object.
Exiting PID 1...
It does not look right but especially this line seems suspicous:
Failed to create /init.scope control group: Read-only file system
It seems like there should have been something before /init.scope
. That was why I reviewed the docker run
options, and tried the --cgroupsns
option. If I add the --cgroupns=host
, it works. If I mount /sys/fs/cgroup
as read-only, then it fails with a different error, and the corresponding line looks like this:
Failed to create /system.slice/docker-0be34b8ec5806b0760093e39dea35f4305262d276ecc5047a5f0ff43871ed6d0.scope/init.scope control group: Read-only file system
To me, it is like the docker daemon/engine fails to configure XXX.slice or something like that for the container. I assume that docker may be to some extend responsible for giving the namespace but something is not going well. However, I can't be so sure at all. What would be the issue/fix?
The Dockerfile I used for this experiment is as follows:
FROM debian:buster-slim
ENV container docker
ENV LC_ALL C
ENV DEBIAN_FRONTEND noninteractive
USER root
WORKDIR /root
RUN set -x
RUN apt-get update -y \
&& apt-get install --no-install-recommends -y systemd \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/* \
&& rm -f /var/run/nologin
RUN rm -f /lib/systemd/system/multi-user.target.wants/* \
/etc/systemd/system/*.wants/* \
/lib/systemd/system/local-fs.target.wants/* \
/lib/systemd/system/sockets.target.wants/*udev* \
/lib/systemd/system/sockets.target.wants/*initctl* \
/lib/systemd/system/sysinit.target.wants/systemd-tmpfiles-setup* \
/lib/systemd/system/systemd-update-utmp*
VOLUME [ "/sys/fs/cgroup" ]
CMD ["/lib/systemd/systemd"]
I am using Debian. The docker version is 20.10.3 or so. Google search told me that docker supports cgroup v2 as of 20.10 but I don't actually understand what that "support" means.