7

If I'm running VMWare 5.x on FC [edit] LUNs presented from a Netapp running ontap 8.x in 7 mode, what steps should I take to ensure that the VMs can withstand a worst possible case cluster node failover? HBA settings? ESX settings? Modifications to the VMs' settings? I know this is all handled with the host kit from Netapp, but we want to know what it's doing and/or do it ourselves.

A cf takeover on Netapp is usually quite fast, but can in some cases run up to 3 minutes.

Basil
  • 8,811
  • 3
  • 37
  • 73
  • 1
    What multipath rules do you have in VMware? – Zapto Aug 15 '13 at 13:34
  • 1
    TimeOutValue within windows can be increased but there is a bit of controversy about this. MS says not to increase it beyond 20 secs but ... – tony roth Aug 15 '13 at 13:44
  • This is a fresh install, I can set any multipath rules I want. My goal is to make it so that a planned or unplanned cluster takeover event in Netapp doesn't offline or stun any VMs. – Basil Aug 15 '13 at 13:56
  • Are you concerned about path failover? – ewwhite Aug 18 '13 at 02:02
  • Path failover is seamless- if you take down a fabric or HBA, VMWare will grab the LUN on its other path no problem. I'm more worried about the long CF takeover process where one Netapp node assumes the identity of the other. It's fast, usually, but can be up to 180 seconds. – Basil Aug 18 '13 at 03:10

1 Answers1

7

Mainly you want to make sure your VM's will be OK with no disk IO. Installing the VMware tools will take care of this with 4.x and newer. You also want to set the right settings on your ESX hosts related to timeouts, these are detailed on the netapp site in TR-3749

Really it does not matter which protocol you are using for cluster fail overs since the guests will see the same thing - their disks will be offline/read only for a brief period. This script can help you set the NetApp recommended settings on the VMs but I am pretty sure the VMware tools install does it for you.

On the ESX/ESXi side it depends on what versions of ESX/OnTap you are running but it's not something intended for you to edit. You can force MRU/RR etc but that is only how it discovers the paths to it's LUNs (unless you are doing something odd in your switching should be pretty much set and forget). You can't actually speed up the failover itself, it's a function of how long it takes for the controller to boot and sync and then replay the data in NVRAM. VMware & NetApp's PSP/SATP work together to manage this stuff. For more info you can use esxcli to look at (and edit a limited about) of these settings. Try esxcli storage nmp and esxcli storage san fc to dig in but personally I wouldn't mess with this stuff unless I had a specific need, even if you manage to squeeze some performance improvements out you may end up breaking compatibility with future releases etc.

For the HBA's it depends, the main thing to look at there is queue depth but that depends on how you have your workload configured. On the filer side you shouldn't need to change the queue settings, it's supposed to adjust itself based on workload.

If you're using FC I highly recommend at least trying NFS as well, it can make your life so much easier.

TheFiddlerWins
  • 2,973
  • 1
  • 14
  • 22
  • since he's starting greenfield and has netapps then NFS is the way.. – tony roth Aug 15 '13 at 14:35
  • I will clarify the question. I am going to install the datastores on FC LUNs provided by Netapp. – Basil Aug 15 '13 at 15:18
  • really wouldn't do this unless you already have an extensive san infrastructure, if you do then you probably already have 8g fiber and you may not have 10ge yet which would obviously change things. – tony roth Aug 15 '13 at 16:03
  • I had read that document already, but it doesn't have a section about setting the timeout values, nor a section about cluster failover. – Basil Aug 15 '13 at 16:09
  • 1
    We do have an extensive SAN infrastructure, and we're standarized on FC for VMWare. We don't have 10ge. I'm really most interested in what we can do to ensure a cluster failover on FC doesn't stun VMs. – Basil Aug 15 '13 at 16:11
  • nuff said, right. – tony roth Aug 15 '13 at 16:14
  • 1
    By default it will use VMW_PSP_MRU to decide what path to use (MPIO on the NetApp side). This is how it determines what PATH to use to access the LUNs but if the controller is failing over (or giveback) like you said it can be offline for quite a while. NetApp recommends bumping up the Disk Timeout value in Windows (I believe the VMware tools does this for you on install). This does not speed up the failover, it just makes the guest OS more patient before giving you a BSOD. – TheFiddlerWins Aug 19 '13 at 17:05
  • This is what I was looking for. I will need to find a SATP for Netapp and have the VMWare admins change the guest timeout in the registry (or ensure that it's set by the VMWare tools). – Basil Aug 20 '13 at 20:32