Situation:
On an integrated All-In-One ESXi/ZFS-Storage server, where the storage VM uses bare metal disks and exports the filesystems via NFS (or iSCSI) back to ESXi, which uses it as pool storage for the other VMs, there exists a problem when time comes to update the storage VM, because numerous running VMs depend on it and will time out with NFS.AllPathsDown
or similar causes, which equals pulling the drive from a normal server without shutting it down.
Of course it is possible to shut down all VMs, but this becomes very time-consuming and also tedious (or has to be scripted). Moving the VMs to another host may be possible, but takes even longer and may not be possible in smaller setups, where a single machine is plenty. Suspending the VMs could work, but is also quite slow (sometimes slower than shutdown).
Possible solutions...
- A simple yet efficient solution seems to be to stop the VM processes via the CLI with
kill -STOP [pid]
after finding it withps -c | grep -v grep | grep [vmname]
, do the upgrade/restart of the storage VM, then continue the execution of the VM processes by usingkill -CONT [pid]
. - A similar solution might be the combination of fast reboot (available on Solaris/illumos via
reboot -f
or on Linux viakexec-reboot
) which takes seconds instead of minutes, and the NFS timeout in ESXi (on loss of NFS connection all I/O is suspended for I think 120 seconds, until it is assumed the storage is down permanently). If the reboot time is inside the ESXi NFS window, it should in theory be comparable to a disk that does not respond for a minute because of error correction, but then resumes normal operation.
... and problems?
Now, my questions are:
- Which method is preferable, or are they equally good/bad?
- What are unintended side effects in special cases like databases, Active Directory controllers, machines with users running jobs etc.?
- Where should one be careful? A comment on the linked blog mentions timekeeping problems may arise when the CPU is frozen, for example.
Edit: To clarify on the scope of this question
After receiving the first two answers, I think I have worded my question not clear enough or left out too much information for sake of brevity. I am aware of the following:
- It is not supported by VMware or anyone else, dont do this!: I did not mention this because the first link already tells it and also I would not have asked if this machine was managed by VMware support. It is a purely technical question, support stuff is out of scope here.
- If designing a new system today, some things could be done in other ways: Correct, but as the system has been running stable for some years, I prefer not to throw the baby out with the bathwater and start completely new, introducing new problems.
- Buy hardware X and you won't have this problem! True, I could buy 2 or 3 additional servers with similar cost and have a full HA setup. I know how this is done, it is not that hard. But this is not the situation here. If this was a viable solution in my case, I would not have asked the question in the first place.
- Just accept the delay of shut down and reboot: I know that this is a possibility, as it is what I'm doing currently. I have asked the question to either find better alternatives within the current setup, or to learn of substantiated technical reasons some of the methods outlined will have problems - "it is unpredictable" without any explanation why is not a substantiated answer in my book.
Therefore, to rephrase the questions:
- Which of those two methods is technically preferable and why, assuming the setup is fixed and the goal is to reduce downtime without introducing any negative side effects to data integrity?
- What are unintended side effects in special cases like
- active/idling/quiescent databases with users and/or applications accessing them
- Active Directory controllers on this machine and/or on other machines (on the same domain)
- general purpose machines idling or with users running jobs or running automated maintenance jobs like backups
- appliances like network monitoring or routers
- network time with or without using NTP on this server or on another or on multiple servers
- In which special cases is it advisable to not do this, because the downsides are greater than the advantage? Where should one be careful? A comment on the linked blog mentions timekeeping problems may arise when the CPU is frozen, for example, but does not provide any reasoning, proof or test results.
- What are the factual, technical differences between those two solutions and
- Stalled execution of VM processes because of CPU overload on the host
- Increased Wait time on disk I/O because of faulty disks or controllers, assuming it is below the NFS threshold?