I use MAAS 2.4.2 to admin a cluster of computers at work. Yesterday I had to reboot a node (already deployed) to fix a driver issue. It never came back up. I was able to IPMI into the machine to check it out and found that it was stuck in the boot process. I tried rebooting to no avail. Thinking that it had to do with something I had done with drivers, I attempted to enter the MAAS rescue mode. When booting the ephemeral rescue image, however, startup got stuck in the same place. On a whim I tried releasing and re-deploying the machine: same thing, the ephemeral image used for deployment starts booting fine and then gets stuck. Here is what the computer looks when trying to deploy:
To me it kind of feels like the node is having trouble talking back to the MAAS server and gets stuck there. I don't know if that should cause the booting to hang indefinitely, but if you notice in the screenshot, there are a lot of steps that don't have an [ OK ]
next to them, and they tend to be network-related.
Interestingly, since my deployment failed the OS that was installed on the computer's SDD before this whole mess started is still there. If I skip the PXE boot at startup and boot right off the SDD I can access Ubuntu's rescue mode. There I can invoke /etc/init.d/networking start
manually and I do get a functional network connection + DNS and all that. The point is that networking seems healthy on the node, so if my theory about the node not being able to talk to MAAS is correct, it means MAAS itself is being uncommunicative.
I'm pretty stuck here, so any ideas would be amazing.
Thanks!