0

I got a huge problem. Some of my Proxmox-based LXC-containers are not responding since 2 days if I do not reboot the node.

This happens always at the same time in the night (I guess there is something happening on a container which causes heavy load).

The problem is: top/atop/htop are not showing anything. The proxmox-node reacts without problems to ssh commands, but 2 of 5 nodes are not really responding (I can login with SSH but I can not enter a command).

I also have to do a "hard" reboot, because the reboot does not work (LXC-containers are not stopping after 40min).

This is my PVE-Version:

pveversion -v
proxmox-ve: 4.1-39 (running kernel: 4.2.8-1-pve)
pve-manager: 4.1-15 (running version: 4.1-15/8cd55b52)
pve-kernel-4.2.6-1-pve: 4.2.6-36
pve-kernel-2.6.32-43-pve: 2.6.32-166
pve-kernel-4.2.8-1-pve: 4.2.8-39
pve-kernel-4.2.2-1-pve: 4.2.2-16
pve-kernel-2.6.32-26-pve: 2.6.32-114
pve-kernel-4.2.3-2-pve: 4.2.3-22
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 1.0-1
pve-cluster: 4.0-33
qemu-server: 4.0-62
pve-firmware: 1.1-7
libpve-common-perl: 4.0-49
libpve-access-control: 4.0-11
libpve-storage-perl: 4.0-42
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.5-9
pve-container: 1.0-46
pve-firewall: 2.0-18
pve-ha-manager: 1.0-24
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-7
lxcfs: 2.0.0-pve1
cgmanager: 0.39-pve1
criu: 1.6.0-1

Unfortunately the logs are not showing anything.

Syslog:

Mar 15 04:32:31 server pvedaemon[4061]: worker exit
Mar 15 04:32:31 server pvedaemon[1192]: worker 4061 finished
Mar 15 04:32:31 server pvedaemon[1192]: starting 1 worker(s)
Mar 15 04:32:31 server pvedaemon[1192]: worker 24675 started
Mar 15 04:33:05 server pvedaemon[6601]: worker exit
Mar 15 04:33:05 server pvedaemon[1192]: worker 6601 finished
Mar 15 04:33:05 server pvedaemon[1192]: starting 1 worker(s)
Mar 15 04:33:05 server pvedaemon[1192]: worker 25112 started
Mar 15 04:34:57 server systemd-timesyncd[559]: interval/delta/delay/jitter/drift 2048s/+0.000s/0.021s/0.001s/+1ppm
Mar 15 04:36:08 server pveproxy[17238]: worker exit
Mar 15 04:36:08 server pveproxy[1212]: worker 17238 finished
Mar 15 04:36:08 server pveproxy[1212]: starting 1 worker(s)
Mar 15 04:36:08 server pveproxy[1212]: worker 28231 started
Mar 15 04:39:48 server pvedaemon[572]: worker exit
Mar 15 04:39:48 server pvedaemon[1192]: worker 572 finished
Mar 15 04:39:48 server pvedaemon[1192]: starting 1 worker(s)
Mar 15 04:39:48 server pvedaemon[1192]: worker 31498 started
Mar 15 04:40:40 server pveproxy[31690]: worker exit
Mar 15 04:40:40 server pveproxy[1212]: worker 31690 finished
Mar 15 04:40:40 server pveproxy[1212]: starting 1 worker(s)
Mar 15 04:40:40 server pveproxy[1212]: worker 32442 started
Mar 15 04:45:02 server pvedaemon[25112]: <root@pam> successful auth for user 'root@pam'
Mar 15 04:46:27 server pveproxy[28231]: worker exit
Mar 15 04:46:27 server pveproxy[1212]: worker 28231 finished
Mar 15 04:46:27 server pveproxy[1212]: starting 1 worker(s)
Mar 15 04:46:27 server pveproxy[1212]: worker 5082 started
Mar 15 04:48:45 server pveproxy[17122]: worker exit
Mar 15 04:48:45 server pveproxy[1212]: worker 17122 finished
Mar 15 04:48:45 server pveproxy[1212]: starting 1 worker(s)
Mar 15 04:48:45 server pveproxy[1212]: worker 6924 started
Mar 15 04:51:28 server pvedaemon[25112]: worker exit
Mar 15 04:51:28 server pvedaemon[1192]: worker 25112 finished
Mar 15 04:51:28 server pvedaemon[1192]: starting 1 worker(s)
Mar 15 04:51:28 server pvedaemon[1192]: worker 9770 started
Mar 15 04:51:38 server pveproxy[32442]: worker exit
Mar 15 04:51:38 server pveproxy[1212]: worker 32442 finished
Mar 15 04:51:38 server pveproxy[1212]: starting 1 worker(s)
Mar 15 04:51:38 server pveproxy[1212]: worker 9911 started
Mar 15 04:52:45 server pvedaemon[31498]: worker exit
Mar 15 04:52:45 server pvedaemon[1192]: worker 31498 finished
Mar 15 04:52:45 server pvedaemon[1192]: starting 1 worker(s)
Mar 15 04:52:45 server pvedaemon[1192]: worker 10794 started
Mar 15 04:55:46 server pvedaemon[24675]: worker exit
Mar 15 04:55:46 server pvedaemon[1192]: worker 24675 finished
Mar 15 04:55:46 server pvedaemon[1192]: starting 1 worker(s)
Mar 15 04:55:46 server pvedaemon[1192]: worker 13187 started
Mar 15 04:57:32 server rrdcached[972]: flushing old values
Mar 15 04:57:32 server rrdcached[972]: rotating journals
Mar 15 04:57:32 server rrdcached[972]: started new journal /var/lib/rrdcached/journal/rrd.journal.1458014252.151024
Mar 15 04:57:32 server rrdcached[972]: removing old journal /var/lib/rrdcached/journal/rrd.journal.1458007052.150971
Mar 15 04:57:40 server puppet-agent[14639]: Finished catalog run in 0.53 seconds
MyFault
  • 893
  • 3
  • 14
  • 35

2 Answers2

1

lxcfs: 2.0.0-pve1 had a bug that let the container hang in the kernel.

I have resolved the issue by updating to lxcfs: 2.0.0-pve2. Have look here:

https://forum.proxmox.com/threads/proxmox-4-0-lxc-containers-network-unstable.26353/

BVBMedia
  • 26
  • 1
0

We run the same kernel as you have and also have LXC containers to be hanging completely. The KVM machines on the same host are still up. What can it be and how to get the LXC container to be responding again without rebooting the host?

Even when running the following command on the host it won't proceed:

pct enter ID

BVB Media
  • 1
  • 1