1

I am having a serious problem with a server running ESXi 6.0 Everything worked great last week. Now out of no where the whole thing is basically useless. I am getting datastore latency of up to 51 seconds! Nothing has changed between now and last week other than installing some software on a VM.

Datastore Lag

The server is an HP Proliant DL360 G7 2X hexacore Xeon X5650 2.67GHz 144 GB RAM. 8x 300GB HP 10k SAS hard drives in RAID 10.

I have 6 VM's on the machine, most with thin provisioned VMDK's Out of 1.6 TB I have 600GB free.

2 of the VM's seem to run fine, the others run like total crap.

I have tried rebooting the server. Assigning more resources to the slow VM's (even tho they have plenty) and nothing is working.

Even with every VM powered off I have tried to move the VM's off the server to a storage device on the network and I am getting spikes in data transfer. It will move at 20 -30MB /s for about 20 seconds then drop down to near 0 for a few minutes then back up in a constant pattern which suggests a bottleneck somewhere.

When I try to move data between virtual drives in a powered on VM, same thing happens. Right now I am trying to transfer a file and it is going about 200kb/s. On the slow VM's it takes over 20 minutes to boot and is so slow you can't use it.

Disk transfer rate

I am at a total loss. Any help in resolving this would me much appreciated.

Kyle Vaughan
  • 41
  • 2
  • 10
  • Is this a standalone host or part of a cluster managed by vCenter? Can you post the version and build number of your ESXi host? Was this installed with the HP-specific version of ESXi? – ewwhite Aug 29 '16 at 00:37
  • IT is a standalone host. It was not installed with the HP version of ESXi. It is ESXi 6.0.0 3620759 – Kyle Vaughan Aug 29 '16 at 00:41

1 Answers1

3

I would suggest that your issue is related to the health of your RAID controller's cache and battery/flash module. If the RAID write cache has been disabled due to failure of the RAID battery, for instance, your write performance on the array will degrade severely.

There are a couple of ways to check this. Can you specify if this is a standalone host or part of a cluster managed by vCenter?


Edit:

This host does not appear to have the HP-specific version of ESXi installed.

Without this or the HP add-ons for ESXi, there's no monitoring of the host hardware or any of the utilities necessary to check system status.

Normally, you can see status graphically like this:

enter image description here

enter image description here

I suspect you have a storage battery failure, considering the G7 line was introduced in 2011 and the batteries tend to last 3-5 years in production. If this was a used server, this is the likely cause. You should add them from here, here and here.

At the command line, running the following will show your battery's status (other handy commands):

/opt/hp/hpssacli/bin/hpssacli ctrl all show config detail | grep -i battery

Output:

[root@c2-esx1:~] /opt/hp/hpssacli/bin/hpssacli ctrl all show config detail | grep -i battery
   No-Battery Write Cache: Disabled
   Battery/Capacitor Count: 1
   Battery/Capacitor Status: OK

If the part is bad, we can force it to ignore the battery status using the following (there's risk if you don't have stable power for your equipment):

/opt/hp/hpssacli/bin/hpssacli ctrl slot=0 modify nbwc=enable

This will at least restore performance while you arrange for parts repair/replacement.

ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • This is a standalone host. It is not managed by vcenter server. Its just ESXi on a server and I connect to it with the Vsphere client. (not the web client) – Kyle Vaughan Aug 29 '16 at 00:39
  • Please add the build number and let us know if you installed the HP-specific version of ESXi. – ewwhite Aug 29 '16 at 00:40
  • It is not the HP specific version. It is ESXi 6.0.0 3620759 – Kyle Vaughan Aug 29 '16 at 00:42
  • @KyleVaughan The HP version definitely needs to be installed for actual monitoring and to have tools to work on the server. But you probably have a RAID battery failure. – ewwhite Aug 29 '16 at 01:15
  • I tried to install the HP version but it wouldn't boot so I went with the standard version. I am currently in the process of installing the HP array controller drivers and the utilities ( rebooting now ) will report back with my findings. Hopefully the server boots up as I have no physical access to it. (it's on the other side of the country ) – Kyle Vaughan Aug 29 '16 at 01:21
  • According to the CLI, the battery is good. Any other ideas? – Kyle Vaughan Aug 29 '16 at 01:32
  • Please http://pastebin.com the output of `/opt/hp/hpssacli/bin/hpssacli ctrl all show config detail` – ewwhite Aug 29 '16 at 01:33
  • When I run ctrl slot=3 modify dwc=enable forced It says that the operation is not supported with the current configuration and gives Reason = cache disabled. Any ideas? – Kyle Vaughan Aug 29 '16 at 01:44
  • That's not something you should enable. What is the full status output? – ewwhite Aug 29 '16 at 01:46
  • http://pastebin.com/iP9N1EXp – Kyle Vaughan Aug 29 '16 at 01:46
  • @KyleVaughan - See **_"Cache Status Details: A cache error was detected. Run a diagnostic report for more information."_** - The ECC RAM on the controller detected this error. The module may have a problem. Try reseating it. – ewwhite Aug 29 '16 at 01:48
  • Hmm well that's going to be impossible. The server is about 5,000 Km away. – Kyle Vaughan Aug 29 '16 at 01:51
  • Well, that's what's wrong. – ewwhite Aug 29 '16 at 02:01
  • IS there any way to disable the controller cache and use the drive cache? – Kyle Vaughan Aug 29 '16 at 02:13
  • They're not the same thing... This is up to you to handle, though. A reseat or hard power-off/on may help. – ewwhite Aug 29 '16 at 02:17
  • How did you determine "The ECC RAM on the controller detected this error. The module may have a problem. " ? – Kyle Vaughan Aug 29 '16 at 02:35
  • The error is in your pastebin output. – ewwhite Aug 29 '16 at 02:35
  • So would this cause inconsistent behavior? For instance right now I am copying files within a VM and everything seems just fine. It's seeding away at 80 mb/s and there is no datastore latency. Yet 30 minutes ago i was getting 480 kb/s with 57 second latency. – Kyle Vaughan Aug 29 '16 at 03:57
  • You clearly have an error and a hardware problem. The server is no longer healthy. Repairing this will restore your performance. – ewwhite Aug 29 '16 at 03:58
  • Thank you all. I had someone on site re-seat the cache board and now the cache reads as OK. My datastore latency has gone from 51 seconds down to 6 seconds peak max. Network throughput is now 80+ MB/s . I am curious though why re-seating the card would help. You would think if it was making enough connection to read that it is there then it would just work. – Kyle Vaughan Aug 30 '16 at 02:05
  • It probably vibrated loose or was not securely installed. – ewwhite Aug 30 '16 at 05:39