I am seeking suggestions and comments on how to recover from a 100% full vSAN (without the obvious, reset to factory option). I have an 8 node ESXi cluster which runs entirely on a vSAN backing. Due to circumstances with a vendor that I would prefer not to go in to, the total disk capacity was undersized for the storage requirements. With the end result of the vSAN hitting the 100% utilized wall hard and handling it about as well as an egg hitting a tile floor. Since the hosts themselves also boot from/live on the vSAN; when this condition occurred the hosts locked up and several of them crashed dramatically cutting the available disk size on an already full vSAN. I have been able to regain access to some of the hosts, but with the vSAN thrashing disk in a vain attempt to rebuild the array it is dreadfully slow to respond and vCenter is unavailable so I can only manage individual hosts using SSH & the vCenter thick client. This removes most of my controls over the vSAN object, so I've found my options to recover have been severely limited.
A few points:
- I am well aware that filling any SAN technology to 100% capacity is a recipe for disaster so let's skip those obvious and unhelpful observations.
- I understand and accept that data loss is pretty much inevitable here but I would like to save as much as I can while deleting what I need to in order to recover the cluster to a functional state.
- The manufacturer has already advised that the cluster has to be reset to factory, but I've seen many cases where the community can provide better answers.
- As the cluster is non-functional I am willing to take risks and try radical ideas that would normally be out of the question.