1
edit: The issue was my umask being set to 027 rather than the default of 022. See below for details.
I'm experiencing a bewildering (set of) issues regarding LXC that manifests itself throughout the system after occurring.
When starting/stopping LXC containers, occasionally the start or stop will hang indefinitely. When this happens on startup, the container's init
process is running but unkillable, even using kill -9
. The container never comes online, and the only way to end the process is a system reboot.
Thing is, the system won't reboot any more either. At the same time as this issue I noticed an issue when running update-initramfs
, that also hangs indefinitely. After finding this:
https://unix.stackexchange.com/questions/428001/update-initramfs-hangs-on-debian-stretch
I have concluded that indeed the sync
command (both the utility and system call) are hanging, causing LXC to not work, update-initramfs
to hang, and system shutdown to hang (as a sync
should be done before unmounting filesystems). Once the issue occurs, calling sync
(the utility) from the command line will consistently hang indefinitely. I have tried running it in strace
but the trace ends when going into the kernel call and I can't debug further. I've monitored the caches using this but it just hovers in the <100kB range.
Considering sync
has to do with filesystems I expect there is something wrong with the way LXC is handling some filesystem. I have another identical server that does not use LXC, and after comparing the output of mount
I unmounted the filesystems not present on that one, to no avail. sync
continues to hang.
Now, on a clean boot, and not touching LXC, sync
always works, and continues to work. For this reason and the fact that I'm not seeing other problems I am positive there are no actual I/O issues. Also when a container does start succesfully, it doesn't seem to have any problems.
I have scoured the internet far and wide regarding this issue, with no success.
LXC 2.0.7-2+deb9u2 on Debian 9 (stable) with kernel 4.19.0-0.bpo.4-amd64 (although it happened in other recent kernels too), with 2 SSD's in raid1 for /
and 3 HDD's in raid5 (mdadm) for /home
. Guests are Debian 9 (stretch) or 10 (buster), running as unprivileged containers. I seem to have narrowed it down to this: The issue did not occur for privileged containers.
Example guest container config:
# Template used to create this container: /usr/share/lxc/templates/lxc-download
# Distribution configuration
lxc.include = /usr/share/lxc/config/debian.common.conf
lxc.include = /usr/share/lxc/config/debian.userns.conf
lxc.arch = linux64
# Container specific configuration
lxc.id_map = u 0 200000 100000
lxc.id_map = g 0 200000 100000
# Network configuration
#lxc.network.type = empty
lxc.network.type = veth
lxc.network.link = lxcbr0
lxc.network.flags = up
lxc.network.hwaddr = 00:16:3e:e9:4a:e7
lxc.rootfs = /var/lib/lxc/somename/rootfs
lxc.rootfs.backend = dir
lxc.utsname = somename
# Mounts
lxc.mount.entry = /var/lib/lxc/temp mnt/temp none bind 0 0
and subuid/gid mappings:
# cat /etc/s*id
root:100000:1000000000
root:100000:1000000000
Example container creation, startup, and failing stop:
# lxc-create -n test -t download
...
Distribution: debian
Release: stretch
Architecture: amd64
Downloading the image index
Downloading the rootfs
Downloading the metadata
The image cache is now ready
Unpacking the rootfs
---
You just created a Debian stretch amd64 (20190522_05:24) container.
# lxc-ls -f
NAME STATE AUTOSTART GROUPS IPV4 IPV6
test STOPPED 0 - - -
# lxc-start -n test
# lxc-ls -f
NAME STATE AUTOSTART GROUPS IPV4 IPV6
test RUNNING 0 - - -
# lxc-attach -n test
root@test:/# ls -alh /
total 68K
drwxr-xr-x 21 root root 4.0K May 22 05:26 .
drwxr-xr-x 21 root root 4.0K May 22 05:26 ..
drwxr-xr-x 2 root root 4.0K May 22 05:26 bin
drwxr-xr-x 2 root root 4.0K Mar 28 09:12 boot
drwxr-xr-x 4 root root 400 May 22 09:26 dev
drwxr-xr-x 42 root root 4.0K May 22 09:24 etc
drwxr-xr-x 2 root root 4.0K Mar 28 09:12 home
drwxr-xr-x 9 root root 4.0K May 22 05:25 lib
drwxr-xr-x 2 root root 4.0K May 22 05:25 lib64
drwxr-xr-x 2 root root 4.0K May 22 05:25 media
drwxr-xr-x 2 root root 4.0K May 22 05:25 mnt
drwxr-xr-x 2 root root 4.0K May 22 05:25 opt
dr-xr-xr-x 225 nobody nogroup 0 May 22 09:26 proc
drwx------ 2 root root 4.0K May 22 05:25 root
drwxr-xr-x 3 root root 60 May 22 09:26 run
drwxr-xr-x 2 root root 4.0K May 22 05:26 sbin
drwxr-xr-x 2 root root 4.0K May 22 05:25 srv
dr-xr-xr-x 13 nobody nogroup 0 May 19 17:07 sys
drwxrwxrwt 2 root root 4.0K May 22 05:25 tmp
drwxr-xr-x 10 root root 4.0K May 22 05:25 usr
drwxr-xr-x 11 root root 4.0K May 22 05:25 var
root@test:/# exit
# lxc-ls -f
NAME STATE AUTOSTART GROUPS IPV4 IPV6
debian_buster STOPPED 0 - - -
rtorrent STOPPED 0 - - -
test RUNNING 0 - - -
# lxc-stop -n test
^C
# lxc-stop -n test
... continues to hang ...
# ^C
# sync
^C^C^Z^X^C^Z^X^C^Z^C^Z^X^C
... won't die.