Storage issue - 360 VMs unresponsive

Question

We recently had a fibre channel storage issue (looks like it was a single bad cable) that affected all 360 VMs in 2 clusters attached to the same storage virtualisation device - IBM SVC 2145. The VMs were basically so slow to respond that they were unusable and many were logging symmpi errors in the Windows event logs.

VMware responded with the obvious - "storage issue" - but our storage team is adamant there was no problem with their equipment or zoning. I need to know how one single faulty cable could effectively bring down all VMs in 2 separate clusters.

Has anyone had a similar problem, or able to shed any light?

PS all hosts running vSphere update 1, with patches to December 2009.

Edit: Physical servers attached to the same SVD were apparently unaffected.

Have you looked in the ESX VMKernel logs to see if their are any interesting SCSI messages during the event? What multipathing policy are you using on the ESX hosts for the datastores from that array? Is that policy the one recommended by VMware\IBM? — Helvick, Feb 17 '10 at 19:27

score 1 · Accepted Answer · answered Feb 17 '10 at 18:51

1

I don't believe that a cable fault could cause your corruption - FC datagrams are checksum'ed to prevent such a problem - in fact FC can be one of the most resilient transmission protocols for its speed.

answered Feb 17 '10 at 18:51

Chopper3

100,240
9
106
238

score 0 · Answer 2 · answered Feb 17 '10 at 19:12

You could be over subscribing the links. Perhaps take a look at traffic on the FC network, is it slammed? If so a single link being down could mean high latency for disk I/O. Vkernal has some good software for locating bottlenecks within VMWare cluster. It could shed some light. Hope this helps.

Storage issue - 360 VMs unresponsive

2 Answers2