3

Situation:

On an integrated All-In-One ESXi/ZFS-Storage server, where the storage VM uses bare metal disks and exports the filesystems via NFS (or iSCSI) back to ESXi, which uses it as pool storage for the other VMs, there exists a problem when time comes to update the storage VM, because numerous running VMs depend on it and will time out with NFS.AllPathsDown or similar causes, which equals pulling the drive from a normal server without shutting it down.

Of course it is possible to shut down all VMs, but this becomes very time-consuming and also tedious (or has to be scripted). Moving the VMs to another host may be possible, but takes even longer and may not be possible in smaller setups, where a single machine is plenty. Suspending the VMs could work, but is also quite slow (sometimes slower than shutdown).

Possible solutions...

  1. A simple yet efficient solution seems to be to stop the VM processes via the CLI with kill -STOP [pid] after finding it with ps -c | grep -v grep | grep [vmname], do the upgrade/restart of the storage VM, then continue the execution of the VM processes by using kill -CONT [pid].
  2. A similar solution might be the combination of fast reboot (available on Solaris/illumos via reboot -f or on Linux via kexec-reboot) which takes seconds instead of minutes, and the NFS timeout in ESXi (on loss of NFS connection all I/O is suspended for I think 120 seconds, until it is assumed the storage is down permanently). If the reboot time is inside the ESXi NFS window, it should in theory be comparable to a disk that does not respond for a minute because of error correction, but then resumes normal operation.

... and problems?

Now, my questions are:

  1. Which method is preferable, or are they equally good/bad?
  2. What are unintended side effects in special cases like databases, Active Directory controllers, machines with users running jobs etc.?
  3. Where should one be careful? A comment on the linked blog mentions timekeeping problems may arise when the CPU is frozen, for example.

Edit: To clarify on the scope of this question

After receiving the first two answers, I think I have worded my question not clear enough or left out too much information for sake of brevity. I am aware of the following:

  • It is not supported by VMware or anyone else, dont do this!: I did not mention this because the first link already tells it and also I would not have asked if this machine was managed by VMware support. It is a purely technical question, support stuff is out of scope here.
  • If designing a new system today, some things could be done in other ways: Correct, but as the system has been running stable for some years, I prefer not to throw the baby out with the bathwater and start completely new, introducing new problems.
  • Buy hardware X and you won't have this problem! True, I could buy 2 or 3 additional servers with similar cost and have a full HA setup. I know how this is done, it is not that hard. But this is not the situation here. If this was a viable solution in my case, I would not have asked the question in the first place.
  • Just accept the delay of shut down and reboot: I know that this is a possibility, as it is what I'm doing currently. I have asked the question to either find better alternatives within the current setup, or to learn of substantiated technical reasons some of the methods outlined will have problems - "it is unpredictable" without any explanation why is not a substantiated answer in my book.

Therefore, to rephrase the questions:

  1. Which of those two methods is technically preferable and why, assuming the setup is fixed and the goal is to reduce downtime without introducing any negative side effects to data integrity?
  2. What are unintended side effects in special cases like
    • active/idling/quiescent databases with users and/or applications accessing them
    • Active Directory controllers on this machine and/or on other machines (on the same domain)
    • general purpose machines idling or with users running jobs or running automated maintenance jobs like backups
    • appliances like network monitoring or routers
    • network time with or without using NTP on this server or on another or on multiple servers
  3. In which special cases is it advisable to not do this, because the downsides are greater than the advantage? Where should one be careful? A comment on the linked blog mentions timekeeping problems may arise when the CPU is frozen, for example, but does not provide any reasoning, proof or test results.
  4. What are the factual, technical differences between those two solutions and
    1. Stalled execution of VM processes because of CPU overload on the host
    2. Increased Wait time on disk I/O because of faulty disks or controllers, assuming it is below the NFS threshold?
user121391
  • 2,452
  • 12
  • 31
  • 4
    This sounds like a circular dependency that ought not to exist. My suggestion would be to redesign your storage. – joeqwerty Sep 02 '16 at 13:52
  • @joeqwerty No, circular dependency would be an AD server that needs storage, but the storage only works with running AD. My case is a normal dependency - A (VMs) depends on B (storage), B (storage) depends on C (hypervisor). Completely removing the dependency alltogether would either remove valuable features or increase machines and cost. – user121391 Sep 05 '16 at 07:01

3 Answers3

7

Good question...

But why do you need to reboot the NFS server, anyway?

All-in-one designs aren't reasonable anymore. As a science experiment or small home-lab situation, sure. But like any solution, expect to build in downtime and maintenance windows when necessary.

So...

  • Set your VM startup and shutdown order (good thing to have in place).

enter image description here

  • You can select multiple VMs to shut down or pause at the same time. (I used to suspend VMs when I did this)

enter image description here

  • Do whatever you need to the NFS VM.
  • Eat the downtime.

If you can't have this type of downtime, you should not be running an all-in-one storage and VM setup, or should should consider traditional SAN storage (or a low-cost version) and multiple VM hosts.

ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • OK folks, enough with the carrying on in comments. OP, you have your answer. In regards to ewwhite's answer: it's completely appropriate. Maybe it's not what you're looking for, but in many cases "Well don't do that." **is** an appropriate answer. You asked a question on a site for professionals, so you can expect to get an answer in kind. Ewwhite is a very knowledgable storage/VMware architect. You would do well by not dismissing any advice he might give you. – EEAA Sep 29 '16 at 13:50
  • @EEAA: Then why did he not support his claims in any way? Link to VMware technical spec (I could not find anything appropriate), link to discussions, even a "at one customer I have seen something like that and it exploded because of X" would help. If he is so knowledgeable like you say, why does he not share his knowledge in a similar way as Adam did? It baffles me how you are so fixated on the surroundings instead of the core questions. All the professionals I have interacted with were professional about the topics at hand, regardless of their own beliefs, so I also expected that here. – user121391 Sep 29 '16 at 13:58
  • @user121391 Look at the github tutorial he linked to, and then perhaps re-evaluate whether or not you think he's capable of lending useful advice on this topic. – EEAA Sep 29 '16 at 13:59
6

My suggestion would be to avoid this problem altogether. You mentioned that increased costs and a complete re-architecting are show stoppers, but what you could consider in this situation is to have two storage VMs on the host in a two-node failover cluster. This would allow you to patch either one of them (but not both at the same time) without affecting the availability of NFS or iSCSI served by the cluster. It still isn't a supported solution, but it does at least allow some flexibility in maintenance at the cost of increased resource overhead (mainly however much memory you give to the second storage VM) for storage.

If changing the architecture is completely unacceptable, then the safest option would be to shut down the VMs.

The next-best solution would be to enable hibernation in your VMs. Hibernation would ensure that all filesystems are quiesced, helping avoid possible corruption.

Next, you could take a snapshot of the VM with memory state, forcibly terminate the VM's process, then revert to the snapshot when done. This incurs a small window of possibly lost data, but I'm sure you would never try this outside of a maintenance window where any data loss would be unacceptable, so this should be fairly inconsequential. This solution is as quick as making a snapshot, ensures VMs don't complain about lost disks, but does incur potential data loss.

Lastly, if you want to pause the processes (and have tested that it actually does work), then I would strongly suggest that you sync all disks in the guest first (in Linux, this would be done with /bin/sync. Utility provided by SysInternals for Windows: http://technet.microsoft.com/en-us/sysinternals/bb897438.aspx), and perform your maintenance quickly so clocks don't get set back too far.

As for potential side effects, any AD connected machine must be (by default) within 5 minutes of the DC's time. Therefore, after any solution where the VM is not continually available other than a normal shutdown, I would suggest that you force the resumed guest to update its clock. On database servers, don't do these things when the server is busy, as it increases the chances of filesystem corruption.

The main risk in all of the options beyond a normal shutdown or highly-available storage is that of corruption. There will potentially be some I/O in a buffer that will be dropped which the application may mistakenly think was completed successfully. Worse yet, I/Os may have been re-ordered by a lower layer for a more optimal write pattern. This could allow data to have been partially written out of order. Perhaps the row count was incremented before a DB row's data was written, or a checksum updated before the checksummed data was physically changed. This can be mitigated by only allowing synchronous writes to your storage, but at the cost of performance.

Adam Rushad
  • 646
  • 4
  • 5
  • Thank you for your detailed answer. You have added some new points I did not consider, like the two virtual nodes or the clock drift/forced update. This would be pretty easy to automate with manual ntpdate/Windows planned tasks after startup, I presume. Your last paragraph adresses the core of my problem - avoiding corruption at all costs. My whole setup already always does synchronized writes (ESXi as well as ZFS), using an SLOG device to speed up, so I assumed I am fine (transactional databases would think data has not been written and re-do the transaction log on startup). Is this correct? – user121391 Sep 12 '16 at 07:10
  • If you are enforcing synchronous writes, then you should be mostly safe, although your storage will be slower than it could be. With ZFS, I think you should even be safe against the RAID5 write hole (assuming you are not RAID1/0). – Adam Rushad Sep 21 '16 at 22:53
  • Regarding your DC advice with time sync, I've read similar at https://www.sole.dk/how-to-configure-your-virtual-domain-controllers-and-avoid-simple-mistakes-with-resulting-big-problems (he also mentions that out-of-clock DCs do not replicate anymore with each other). But my plans are to stay within 60 seconds, timeout all devices at the same time and for the same length (except the NTP server) and use NTP to re-sync. – user121391 Sep 22 '16 at 06:47
  • Regarding the sync writes: that was the original idea. Performance is sufficient in my case, the server is not fully utilized anyways. I've also read that the VM disk timeout can/should be increased here (http://nex7.blogspot.de/2013/03/vm-disk-timeouts.html), so I assume my plan would fall inside this time window. Now the question is: is NFS stateless and a restarted process can take over flawlessly (I am assuming yes from the info in the link (SAN failover), but I think I should test it nevertheless. – user121391 Sep 22 '16 at 06:53
  • 1
    NFSV3, which is all that VMware uses up until vSphere 6, is stateless. NFSv4, however, is not. If you place the NFS state directory on the storage passed between your storage VMs, your state should fail over, too. However, if you share the storage between two storage VMs, you can also switch entirely to iSCSI, allowing both machines to be used for writes with complete safety. In this example, you actually wouldn't need to enforce sync writes, and would be using VMware's VMFS. With DRBD, you could expand availability across multiple physical machines, if you so desired. – Adam Rushad Sep 22 '16 at 09:34
  • Thank you for the clarifications. When setting up the server, I chose NFSv3 because of alleged simplicity, so it seems this now pays off. In the future I might think about iSCSI with more flexibility, but then I could just serve additional disks directly to VMs (assumed they support iSCSI). But wouldn't using iSCSI still need synced writes (writeback cache or how it is called) because even multiple storage servers could go down at the same time? – user121391 Sep 22 '16 at 09:47
  • Synced writes on disk would be writethrough or no cache. Writeback cache writes to the cache first, then to disk. As for needing sync write with iSCSI, both storage VMs can serve the same disks or let DRBD sync replicate with both writeable. At that point, anything the iSCSI client (VM or host) thinks it has written, has been. Your storage resiliency is effectively the same as not using storage VMs. Rebooting a single storage VM wouldn't really impact that. Sync is always more reliable, though. Note that my answer assumes you export disks rather than files with iSCSI, and no software RAID. – Adam Rushad Sep 23 '16 at 00:35
  • Thank you. I've accepted your answer, as it is the only one that applies to my question and you also contributed several good new points like the sync tool and the time sync problem. If you'd like to edit the answer with more in-depth knowledge, I would appreciate it and award you the bounty (if no other better answers surface in the meantime). – user121391 Sep 23 '16 at 08:26
5
  1. Which method is preferable, or are they equally good/bad?

Neither.

This is the cost of a terrible design, I wouldn't make this situation worse by doing anything but shutting down your VMs, working on the storage VM then restarting the other VMs. I'd also get someone to redesign your setup using a supported/supportable architecture.

  1. What are unintended side effects in special cases like databases, Active Directory controllers, machines with users running jobs etc.?

It's inherently unpredictable, what might happen this time may not happen if you did this again. It's unsupportable.

  1. Where should one be careful? A comment on the linked blog mentions timekeeping problems may arise when the CPU is frozen, for example.

It's difficult to answer this constructively.

Chopper3
  • 100,240
  • 9
  • 106
  • 238
  • I am aware that it is a problem in the design, but I have not found an alternative that gives me the same benefits at the same level of cost. For example, using 2 ESXi hosts and 2 SANs as well as VSphere licences removes the problem via VMotion, but it costs almost 4 times as much initially and also yearly. On the other hand, a simple ESXi setup has no data integrity via ZFS. If you have a solution for this problem, I would be interested to hear it. – user121391 Sep 02 '16 at 09:57
  • Also, could you please explain what you mean with unpredictable? Shouldn't the first version similar to Suspend (just not to disk, but to RAM), and the second one to lagging storage? Especially the second one should only halt all operations, as of course NFS and backing ZFS datastores use sync writes always. – user121391 Sep 02 '16 at 10:03
  • Consider building or buying physical ZFS storage arrays with the desired HA features. For example, see what ewwhite linked to above. Skip the vSphere and the second node if you don't need a clustered hypervisor. – John Mahowald Sep 03 '16 at 17:39
  • @JohnMahowald Please see my updated question for clarification. – user121391 Sep 06 '16 at 06:43
  • While your answer rings true, it also sounds combative and is not constructive. The correct answer is to point to the flaw in the design as it sits. Any changes he makes will require new expenditures. The best path forward is what he needs. The accepted answer, from Adam Rushad points out the way forward. Specifically the problem is that his design includes an avoidable single point of failure in the storage VM. – Jeter-work Sep 28 '16 at 17:55
  • @Xalorous - firstly the question was far simpler when I answered, it's since been added to greatly, undermining my admittedly-simplistic response - secondly what he is doing is stupid, this site is for professional sysadmins build production-quality systems - not for hacks, not for home use - and this solution IS unprofessional and I will continue, as I have for the best part of a decade on this site, a site I helped setup, recommend against such folly with as much clarity as I like. – Chopper3 Sep 28 '16 at 18:20
  • @Chopper3 go for it, but pointing at something and saying, "It's wrong, it's wrong" and offering nothing is not constructive criticism. Your answer adds zero value to this question as a point of reference. OP needs help seeing the flaw, needs to know where the design deviated from best practice. Especially since sysadmins today often inherit systems full of flaws built in by architects and engineers, and change management prevents a sysadmin from 'fixing it'. Now he/she has to justify the change. A good answer (in this case) gives a sysadmin the ammo needed to scope and justify a change. – Jeter-work Sep 29 '16 at 12:25
  • @Chopper3 As I said, I am perfectly fine with "negative" outcomes. You can tell me that the design is bad and I will listen to you, but only if you also tell me *WHY*. As a positive example, Adam's comments about the differences of state in NFS 3 and 4 are very good: now I know that changing the versions from 3 to 4 might introduce problems (a fact that I was not aware of). If you can provide similar arguments for the other questions, I would welcome your edit. – user121391 Sep 29 '16 at 13:22
  • @Xalorous If I would have needed "ammo", I would have asked for it, but I did not. The question was honestly just about the technical side, because I could only find sparse information about it online and thought this would not only add to my knowledge, but to others as well. Design and architecture have always drawbacks, compromises and ugly sides to them, at least in practice - it depends on how the goals are met, and those goals are different for different situations/people. So, please focus on the technical side, so everyone can draw his/her own conclusions from them, okay? – user121391 Sep 29 '16 at 13:25