15

I have an ESXi box with HP LeftHand storage exposed via iSCSI.

I have a virtual machine with a 1TB disk, of which 800GB is consumed. The disk is thick provisioned on the LeftHand storage.

A snapshot was open on the VM (so that Veeam Backup and Recovery could do its thing), and was open for around 6 hours. A delta disk of around 5GB was created during this time.

The snapshot removal has now taken over 5 hours, and still isn't complete. The storage array is reporting virtually no IOPS on that array (around 600, which is background noise), no throughput (around 8MB/sec, which again - background noise), an average queue depth of 9.

In other words, the snapshot consolidation process doesn't seem to be IO bound, I can't see anything that's causing the snapshot removal to be so damn slow. It is working, judging by watching the delta files.

Anything else that I should look at as to why this (relatively small) snapshot is so slow to be removed?


As per the VMWare documentation, I'm watching ls -lh | grep -E "delta|flat|sesparse" right now, and I see two delta files that are changing:

-rw-------    1 root     root      194.0M Jun 15 01:28 EXAMPLE-000001-delta.vmdk
-rw-------    1 root     root      274.0M Jun 15 01:27 EXAMPLE-000002-delta.vmdk

I'm deducing that one snapshot file is being consolidated whilst the other one collects delta during the consolidation process. Then the new one is consolidated, and another delta is created during that process.

The file sizes are dropping with each iteration (well, most iterations), so I assume that eventually this consolidation procedure will complete (maybe I'll need to take the VM off the network for 30 minutes to let this finish without generating any changes).

It's taking around 2 minutes per hundred megs of delta to consolidate. This has certainly never happened before. Snapshot removal under a normal Veeam backup takes around 40 minutes (so certainly not fast, but not this slow).


After 6 hours and 2 minutes, the snapshot is finally removed. However I'd still like to know if there's any way you would normally troubleshoot this sort of issue (outside of storage performance).

Mark Henderson
  • 68,316
  • 31
  • 175
  • 255
  • I can't help notice that 8Mbit/second is pretty close to 10Mbit/sec networking minus some overhead. Any chance this is a network related problem on the iSCSI link - dodgy patch lead just starting to fail? Is it a single link, a single host, is the host otherwise performing OK for sustained reads/writes? Can you check the switch port for errors? – TessellatingHeckler Jun 17 '15 at 06:34
  • @TessellatingHeckler I just did some tests and I can still get around 1.5Gbit/sec sequential from the array, which is what I would expect to get from it under normal circumstances. Last night the snapshot removal took *three minutes* which is by far the fastest I've *ever* seen it (normally it's about 10x that long, but there was a big football game on here last night so I suspect that nobody was using the systems after hours when the backups run, hence the tiny delta and small commit time). So it *can* do it quickly, just that one time it didn't. – Mark Henderson Jun 17 '15 at 21:22
  • Hmm. Do you have VMware Storage IO Control running, and is the datastore shared with other VMs? Any chance it was hitting some throttling/soft limit there, without stressing the host or SAN hardware? – TessellatingHeckler Jun 17 '15 at 23:50
  • ESXi and vCenter Version? – Nils Jun 19 '15 at 05:21
  • @Nils 5.5 for both – Mark Henderson Jun 19 '15 at 06:07
  • The 10Mbit gibt hint is still nagging me. Is your storage connected to such a slow network and is that network connected to the vCenter or ESXi host, too? – Nils Jun 20 '15 at 20:54
  • @Nils it's gigabit everywhere. I haven't seen that behaviour since, and I've done quite a few snapshot removals since then. – Mark Henderson Jun 21 '15 at 08:26
  • So it's one of these problems that goes away when take a look at them... – Nils Jun 23 '15 at 05:27
  • Can this depend on what has happened on the disk? By the nature of incremental deltas, I would expect that for instance changing 1 character in a million files would be exponentially slower than appending a 1 MB file. – Nemo Apr 21 '20 at 08:04

1 Answers1

2

It is my understanding that ESXI snapshot removal can (and usually does) take a long time. Before the snapshot can be removed the changes from the old snapshot need to be written to the next snapshot in order. I was taught to always delete snapshots from oldest to most recent to help this process run as quickly and efficiently as possible.

Naturally, the more changes between snapshots the longer the merge will take.

Andrew Meyer
  • 253
  • 1
  • 4
  • 2
    Right, except 6 hours to remove a 5GB snapshot is absurd. As I mentioned, it normally takes around 40 minutes to remove the snapshot, and I even feel that 40 minutes is too damn slow. This was the only snapshot on that VM, and also snapshot removal has changed in later versions of ESXi in that the order that they're removed in doesn't matter too much. – Mark Henderson Jun 15 '15 at 22:16
  • 2
    I've seen the slow snapshot behavior before with little I/O on storage but never traced it down to a cause. I always just assumed the hypervisor was chewing on the deltas in-memory. ( The machines in question were using direct-attached storage or I might have looked at SAN issues too, but I've always chalked it up to either big deltas or unoptimized code in VMWare's snapshot subsystem). – voretaq7 Jun 25 '15 at 17:17