Firstly, i've recently taken on the management of a proxmox cluster which I have had no experience managing previously (i'm completely new to cluster management, but not too bad at linux).
pve-manager/5.1-46/ae8241d4 (running kernel: 4.13.13-6-pve)
I have 2 xen nodes which run a number of containers and VMs within them. Yesterday, a container on Xen2, which runs a mysql database, stopped responding. I was able to log in to the container via ssh and attempted to restart mysql only to receive an error along the lines that it was unable to connect to the mysql.sock. So I decided to simply shutdown the container and start it back up. I chose 'shutdown' in proxmox UI for the container, which it then shutdown. Then I clicked 'start', in which proxmox logs recorded:
CT 110 - Start ERROR: command 'systemctl start pve-container@110' failed: exit code 1
So, I've tried running the 'system start ...' via ssh. It takes a while, and then I get the following:
Job for pve-container@110.service failed because a timeout was exceeded.
See "systemctl status pve-container@110.service" and "journalctl -xe" for details.
Here is the output of 'systemctl status ...':
● pve-container@110.service - PVE LXC Container: 110
Loaded: loaded (/lib/systemd/system/pve-container@.service; static; vendor preset: enabled)
Active: failed (Result: timeout) since Thu 2018-06-07 08:35:22 BST; 43s ago
Docs: man:lxc-start
man:lxc
man:pct
Process: 1603366 ExecStart=/usr/bin/lxc-start -n 110 (code=killed, signal=TERM)
Tasks: 1 (limit: 4915)
CGroup: /system.slice/system-pve\x2dcontainer.slice/pve-container@110.service
└─1532500 [lxc monitor] /var/lib/lxc 110
Jun 07 08:33:52 xen2 systemd[1]: Starting PVE LXC Container: 110...
Jun 07 08:35:22 xen2 systemd[1]: pve-container@110.service: Start operation timed out. Terminating.
Jun 07 08:35:22 xen2 systemd[1]: Failed to start PVE LXC Container: 110.
Jun 07 08:35:22 xen2 systemd[1]: pve-container@110.service: Unit entered failed state.
Jun 07 08:35:22 xen2 systemd[1]: pve-container@110.service: Failed with result 'timeout'.
and 'journalctl -xe':
Jun 07 08:35:22 xen2 systemd[1]: pve-container@110.service: Start operation timed out. Terminating.
Jun 07 08:35:22 xen2 systemd[1]: Failed to start PVE LXC Container: 110.
-- Subject: Unit pve-container@110.service has failed
-- Defined-By: systemd
--
-- Unit pve-container@110.service has failed.
--
-- The result is failed.
Jun 07 08:35:22 xen2 systemd[1]: pve-container@110.service: Unit entered failed state.
Jun 07 08:35:22 xen2 systemd[1]: pve-container@110.service: Failed with result 'timeout'.
Shortly after attempting to restart the container the first time, the entire xen2 node started displaying grey questions marks along side all it's VM/containers and they lost their labels (see screenshot):
Despite this, all the other VMs/Containers within xen2 are still functioning fine. So, I've then decided to run the following commands to see what would happen:
service pvedaemon restart (nothing changed) service pveproxy restart (nothing changed) service pvestatd restart (The VMs started showing names within proxmox UI (but not containers), but this only lasted 10-15 minutes)
I'm hesitant to upgrade or restart the entire xen node due to the unknown side of configuration and what potential pitfalls may lie ahead and that its business critical to have at least something running. Furthermore, i've ran through /var/log/syslog and didn't see anything that indicated why the container crashed.
Ideally, I want to achieve: Determine why the database container crashed (110) Successfully start up the database container again Determine why the xen2 node isn't reporting data to the UI about it's VM/Containers Fix the reporting data in the UI for the node Again, please appreciate i'm new to proxmox, but I do know my away around linux.
Thank you for any tips/knowledge on troubleshooting this problem. If there is any other info you'd like me to share, please let me know.
Cheers, David