-2

at my Company we were experiencing some weird spikes on IO latency on one of our ESXi instances.

we've spend 24h figuring out whats wrong and no clue so far.

after giving up we put all the disks into a different server (HP DL380 G7) with much less RAM and only one 6(HT) cores (from 12 on the DL 580) which run fine for about 2 hours.

I dont know the specs for the DL380 but both servers have a Smart Array P410i with BBWC (the DL 580 has 1GB)

Is it possible that one(or all) of the disks is failing without actually failing?

  • 1
    A single VM experiences IO latency, or all of them? – SpacemanSpiff Jun 28 '12 at 03:39
  • How many disks, which disks, which RAID, controller firmware etc Do you have disk-intensive apps? Does your applications/OS have enough RAM for disk cache? – GioMac Jun 28 '12 at 07:07
  • Could you provide some quantitative measures here? Which is it 10ms,20ms,100ms,1s? What are the percentiles? – pfo Jun 28 '12 at 07:39
  • >A single VM experiences IO latency, or all of them? the whole Host is freezing, even the ssh commandline >Could you provide some quantitative measures here? more like 10s or more – teddydestodes Jun 28 '12 at 23:12

2 Answers2

1

What steps did you take to troubleshoot during the 24-hours on the DL580 system?

Both of those systems feature the same Smart Array P410 RAID controller. Was the cache balance configured the same on the DL580 G7 and the DL380 G7?

For something like VMWare ESXi local storage use, I'd set the controller to 25%:75% read:write ratio.

Now details... Make sure you look at the following:

  • Which build of VMWare ESXi were you using? Was it the most recent version?
  • Try installing the HP health agents if possible. This would report the array and controller health to VMWare.
  • Installing the HP utilities will let you query the health and manage the RAID controller from VMWare.
  • What is the RAID array configuration? How many disks? Your tags say RAID 6. RAID 6 is a poor choice for mixed VM workloads, so that could be a consideration.
  • How were you measuring the latency spikes? From within a VM? At the datastore level? esxtop? Depending on your measurement method, this could be a VM-level issue.
  • Make sure the firmware on the server and the associated RAID controllers up-to-date. That really can make a difference on HP gear. Since you're working with VMWare, I'd just download the current HP Firmware DVD and let it boot (with the disks inserted). That will bring things up-to-date and reduce the chance of a firmware bug being the cause.
ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • thanks for the help, we have updated everything, but there were only smaller non critical updates. anyways: one of the disks finally failed and everything if fine now – teddydestodes Jun 28 '12 at 20:51
  • on a side note: this server(DL580G7) has its second systemboard and 4th SPI board, one of the CPUs failed and now the second drive, got a new power backplane and a new SAS backplane, all in 1.5 years of usage. – teddydestodes Jun 28 '12 at 23:22
1

In case someone is experiencing the same problems: It was indeed one of the disks failing, after we came back earlier today we found one of the four disks with an amber led.

After the raid controller had found it failed everything went back to normal and after we've switched back to the original server the latency was below 10ms again.

though the DL380 G7 wont recognize it's Capacitor and wont activate its cache, but that is another story.

  • You can set the controller to enable the write cache regardless of the battery or capacitor status. In the BIOS utility, select *Cache Settings* and *Select Enable Write-Cache Battery Override*. – ewwhite Jun 28 '12 at 21:24
  • ... i don't think that is such a bright idea in case of a power loss ;) – teddydestodes Jun 28 '12 at 23:08