Troubleshooting latency spikes on ESXi NFS datastores

Question

I'm experiencing fsync latencies of around five seconds on NFS datastores in ESXi, triggered by certain VMs. I suspect this might be caused by VMs using NCQ/TCQ, as this does not happen with virtual IDE drives.

This can be reproduced using fsync-tester (by Ted Ts'o) and ioping. For example using a Grml live system with a 8GB disk:

Linux 2.6.33-grml64:
root@dynip211 /mnt/sda # ./fsync-tester
fsync time: 5.0391
fsync time: 5.0438
fsync time: 5.0300
fsync time: 0.0231
fsync time: 0.0243
fsync time: 5.0382
fsync time: 5.0400
[... goes on like this ...]

That is 5 seconds, not milliseconds. This is even creating IO-latencies on a different VM running on the same host and datastore:

root@grml /mnt/sda/ioping-0.5 # ./ioping -i 0.3 -p 20 .
4096 bytes from . (reiserfs /dev/sda): request=1 time=7.2 ms
4096 bytes from . (reiserfs /dev/sda): request=2 time=0.9 ms
4096 bytes from . (reiserfs /dev/sda): request=3 time=0.9 ms
4096 bytes from . (reiserfs /dev/sda): request=4 time=0.9 ms
4096 bytes from . (reiserfs /dev/sda): request=5 time=4809.0 ms
4096 bytes from . (reiserfs /dev/sda): request=6 time=1.0 ms
4096 bytes from . (reiserfs /dev/sda): request=7 time=1.2 ms
4096 bytes from . (reiserfs /dev/sda): request=8 time=1.1 ms
4096 bytes from . (reiserfs /dev/sda): request=9 time=1.3 ms
4096 bytes from . (reiserfs /dev/sda): request=10 time=1.2 ms
4096 bytes from . (reiserfs /dev/sda): request=11 time=1.0 ms
4096 bytes from . (reiserfs /dev/sda): request=12 time=4950.0 ms

When I move the first VM to local storage it looks perfectly normal:

root@dynip211 /mnt/sda # ./fsync-tester
fsync time: 0.0191
fsync time: 0.0201
fsync time: 0.0203
fsync time: 0.0206
fsync time: 0.0192
fsync time: 0.0231
fsync time: 0.0201
[... tried that for one hour: no spike ...]

Things I've tried that made no difference:

Tested several ESXi Builds: 381591, 348481, 260247
Tested on different hardware, different Intel and AMD boxes
Tested with different NFS servers, all show the same behavior:
- OpenIndiana b147 (ZFS sync always or disabled: no difference)
- OpenIndiana b148 (ZFS sync always or disabled: no difference)
- Linux 2.6.32 (sync or async: no difference)
- It makes no difference if the NFS server is on the same machine (as a virtual storage appliance) or on a different host

Guest OS tested, showing problems:

Windows 7 64 Bit (using CrystalDiskMark, latency spikes happen mostly during preparing phase)
Linux 2.6.32 (fsync-tester + ioping)
Linux 2.6.38 (fsync-tester + ioping)

I could not reproduce this problem on Linux 2.6.18 VMs.

Another workaround is to use virtual IDE disks (vs SCSI/SAS), but that is limiting performance and the number of drives per VM.

Update 2011-06-30:

The latency spikes seem to happen more often if the application writes in multiple small blocks before fsync. For example fsync-tester does this (strace output):

pwrite(3, "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 1048576, 0) = 1048576
fsync(3)                                = 0

ioping does this while preparing the file:

[lots of pwrites]
pwrite(3, "********************************"..., 4096, 1036288) = 4096
pwrite(3, "********************************"..., 4096, 1040384) = 4096
pwrite(3, "********************************"..., 4096, 1044480) = 4096
fsync(3)                                = 0

The setup phase of ioping almost always hangs, while fsync-tester sometimes works fine. Is someone capable of updating fsync-tester to write multiple small blocks? My C skills suck ;)

Update 2011-07-02:

This problem does not occur with iSCSI. I tried this with the OpenIndiana COMSTAR iSCSI server. But iSCSI does not give you easy access to the VMDK files so you can move them between hosts with snapshots and rsync.

Update 2011-07-06:

This is part of a wireshark capture, captured by a third VM on the same vSwitch. This all happens on the same host, no physical network involved.

I've started ioping around time 20. There were no packets sent until the five second delay was over:

No.  Time        Source                Destination           Protocol Info
1082 16.164096   192.168.250.10        192.168.250.20        NFS      V3 WRITE Call (Reply In 1085), FH:0x3eb56466 Offset:0 Len:84 FILE_SYNC
1083 16.164112   192.168.250.10        192.168.250.20        NFS      V3 WRITE Call (Reply In 1086), FH:0x3eb56f66 Offset:0 Len:84 FILE_SYNC
1084 16.166060   192.168.250.20        192.168.250.10        TCP      nfs > iclcnet-locate [ACK] Seq=445 Ack=1057 Win=32806 Len=0 TSV=432016 TSER=769110
1085 16.167678   192.168.250.20        192.168.250.10        NFS      V3 WRITE Reply (Call In 1082) Len:84 FILE_SYNC
1086 16.168280   192.168.250.20        192.168.250.10        NFS      V3 WRITE Reply (Call In 1083) Len:84 FILE_SYNC
1087 16.168417   192.168.250.10        192.168.250.20        TCP      iclcnet-locate > nfs [ACK] Seq=1057 Ack=773 Win=4163 Len=0 TSV=769110 TSER=432016
1088 23.163028   192.168.250.10        192.168.250.20        NFS      V3 GETATTR Call (Reply In 1089), FH:0x0bb04963
1089 23.164541   192.168.250.20        192.168.250.10        NFS      V3 GETATTR Reply (Call In 1088)  Directory mode:0777 uid:0 gid:0
1090 23.274252   192.168.250.10        192.168.250.20        TCP      iclcnet-locate > nfs [ACK] Seq=1185 Ack=889 Win=4163 Len=0 TSV=769821 TSER=432716
1091 24.924188   192.168.250.10        192.168.250.20        RPC      Continuation
1092 24.924210   192.168.250.10        192.168.250.20        RPC      Continuation
1093 24.924216   192.168.250.10        192.168.250.20        RPC      Continuation
1094 24.924225   192.168.250.10        192.168.250.20        RPC      Continuation
1095 24.924555   192.168.250.20        192.168.250.10        TCP      nfs > iclcnet_svinfo [ACK] Seq=6893 Ack=1118613 Win=32625 Len=0 TSV=432892 TSER=769986
1096 24.924626   192.168.250.10        192.168.250.20        RPC      Continuation
1097 24.924635   192.168.250.10        192.168.250.20        RPC      Continuation
1098 24.924643   192.168.250.10        192.168.250.20        RPC      Continuation
1099 24.924649   192.168.250.10        192.168.250.20        RPC      Continuation
1100 24.924653   192.168.250.10        192.168.250.20        RPC      Continuation

2nd Update 2011-07-06:

There seems to be some influence from TCP window sizes. I was not able to reproduce this problem using FreeNAS (based on FreeBSD) as a NFS server. The wireshark captures showed TCP window updates to 29127 bytes in regular intervals. I did not see them with OpenIndiana, which uses larger window sizes by default.

I can no longer reproduce this problem if I set the following options in OpenIndiana and restart the NFS server:

ndd -set /dev/tcp tcp_recv_hiwat 8192 # default is 128000
ndd -set /dev/tcp tcp_max_buf 1048575 # default is 1048576

But this kills performance: Writing from /dev/zero to a file with dd_rescue goes from 170MB/s to 80MB/s.

Update 2011-07-07:

I've uploaded this tcpdump capture (can be analyzed with wireshark). In this case 192.168.250.2 is the NFS server (OpenIndiana b148) and 192.168.250.10 is the ESXi host.

Things I've tested during this capture:

Started "ioping -w 5 -i 0.2 ." at time 30, 5 second hang in setup, completed at time 40.

Started "ioping -w 5 -i 0.2 ." at time 60, 5 second hang in setup, completed at time 70.

Started "fsync-tester" at time 90, with the following output, stopped at time 120:

fsync time: 0.0248
fsync time: 5.0197
fsync time: 5.0287
fsync time: 5.0242
fsync time: 5.0225
fsync time: 0.0209

2nd Update 2011-07-07:

Tested another NFS server VM, this time NexentaStor 3.0.5 community edition: Shows the same problems.

Update 2011-07-31:

I can also reproduce this problem on the new ESXi build 4.1.0.433742.

I have to say it's been a while since a brand-new user has come to the board with such a well documented and thought-out question - seriously, hats off to you. It's genuinely interesting too, I've not come across fsync-tester before either so thank you. That said I'm not sure I've got anything to add, you've tried so many of the things I would have already - I'd say speak to VMWare themselves to be honest, they're very good at taking this kind of 'long-tail'/'not an actual service outage' stuff seriously. Anyway just wanted to say well done on what you've done so far :) — Chopper3, Jun 29 '11 at 08:39
Unfortunately the VMware website won't let me contact them: "You currently have no active support entitlements" — exo_cw, Jun 29 '11 at 09:59
And this is tested this with the VM in question being the only thing running on the host? — SuperBOB, Jun 29 '11 at 15:58
@SuperBOB: I just tried it again: I can reproduce this with only one VM running on the host. — exo_cw, Jun 29 '11 at 17:54
Are the drive controllers on the vmware hardware compatibility list? — SuperBOB, Jun 30 '11 at 10:40
@SuperBOB: I can reproduce this on pure NFS-storage, no local drives or raid controllers involved. As far as I know, there are only complete NAS systems on the VMware HCL for NFS. I'm primarily using the OpenIndiana NFS server (because of ZFS), but this also happens with Linux NFS servers. — exo_cw, Jun 30 '11 at 11:06
@ewwhite: Yes I tried, it made no difference. The problem occurs even if the ZIL is disabled (which is usually the fastest setting). I don't think the storage itself is to blame, I can access it just fine from other hosts/VMs at the same time the ESXi NFS-mount is blocking. — exo_cw, Jun 30 '11 at 18:07
There are lots of VMWare experts on Twitter, and they are not terribly difficult to find. Might be a way around getting some support. — Kirk, Jun 30 '11 at 19:22
I wonder, if you use the same NAS to present this same storage as iSCSI does the problem persist? I know this may not be your end goal, but its worth a shot. — SpacemanSpiff, Jul 02 '11 at 02:28
@SpacemanSpiff: I tried it, iSCSI works fine. But I'd like to use NFS for easier access to the VMDK files. — exo_cw, Jul 02 '11 at 07:21
If I understand this correctly it sounds like NFS is the culprit more than VMware. Have you taken a sniffer capture of what's happening on the network while the NFS mount is waiting? The fact that it's almost exactly 5 seconds in each case says "timeout->retry" to me. I once encountered a similar connection dropping issues with NFS. It was due to the way NFS does fragmentation combined with a flawed router ACL implementation. — Ryan, Jul 05 '11 at 20:09
@Ryan - it could just as easily be VMware's NFS client, but certainly it's isolated to that file protocol! — SpacemanSpiff, Jul 06 '11 at 03:38
Although we ultimately rejected using NFS for our VMFS datastores in favour of FCoE we did test this on a number of rigs (NetApp, HP etc.) pretty extensively and didn't see the variability seen here. — Chopper3, Jul 06 '11 at 09:34
@Ryan: I've did a wireshark capture, see update. You are right, this looks like some kind of timeout. When I start ioping (which writes 1M in small blocks+fsync) it just sits around for five seconds doing nothing, then I see the first NFS traffic. — exo_cw, Jul 06 '11 at 09:34
@Chopper3: I've already thought about buying some small VMware-certified NFS-NAS appliance to test this. Any suggestions? Netapp is out of budget, unfortunately. The VMware storage HCL does not list any devices if I select both NFS and ESXi, only for ESX. — exo_cw, Jul 06 '11 at 09:46
I've got quite a lot of spare netapp boxes, what sort of storage volumes are we talking here? I'm in the UK so not far from you. — Chopper3, Jul 06 '11 at 09:57
5 second timeout with NFS sounded familiar. In linux NFS there is a .7 second timeout for RPC that doubles after each failure and pulls a major after 3 fails (default settings). .7+1.4+2.8 = 4.9 seconds. There are a wide variety of RPC authentication issues that could cause this. — Mark, Jul 06 '11 at 13:58
@Chopper3: My thought was to buy something for testing only. My goal is to use OpenIndiana as a VSA for ZFS + 10GigE performance (by vmxnet3). External NAS storage is either expensive or limited to 1Gbit, therefore I'd like to avoid it. — exo_cw, Jul 06 '11 at 15:33
@Mark: Anything I should try regarding RPC auth? I've checked NFS versions (server: v3) and DNS-settings (server can resolve client), but nothing specific to RPC. — exo_cw, Jul 06 '11 at 15:36
Could I get a longer capture? With a couple instances of the timeouts please. It appears to be a 7 second delay in the capture provided? If possible set the decode to treat it all as TCP. Or send me the pcap file and I'll carve it up myself. Additionally what does nfsstat tell us? — Ryan, Jul 06 '11 at 20:41
@Ryan: I've uploaded the capture file. I've also uploaded the [nfsstat output](https://rapidshare.com/files/3717943853/nfsstat.txt). — exo_cw, Jul 07 '11 at 14:42

score 5 · Accepted Answer · answered Sep 13 '11 at 06:54

5

This issue seems to be fixed in ESXi 5. I've tested build 469512 with success.

answered Sep 13 '11 at 06:54

exo_cw

591
1
5
9

score 3 · Answer 2 · answered Jul 08 '11 at 23:26

Thanks, nfsstat looks good. I've reviewed the capture. Haven't found anything conclusive, but did find something interesting. I filtered on tcp.time_delta > 5. What I found in every delay instance was the exact start of an RPC call. Not all new RPC calls were slow, but all slowdowns occurred at the exact start of an RPC call. Also, from the capture it appears that 192.168.250.10 contains all the delay. 192.168.250.2 responds immediately to all requests.

Findings:

Delays always occur in the first packet of an RPC call
NFS Command types were not correlated to delay instances
Fragmentation = delays only first packet

A large Write Call can break up into 300 individual TCP packets, and only the first is delayed, but the rest all fly through. Never does the delay occur in the middle. I'm not sure how window size could effect the beginning of the connection so drastically.

Next steps: I'd start tweaking the NFS options like NFSSVC_MAXBLKSIZE downward rather than the TCP window. Also, I noticed that 2.6.18 works while 2.6.38 doesn't. I know that support was added for the VMXnet3 driver during that timeframe. What NIC drivers are you using on the hosts? TCP offloading yes/no? Around the 95second mark there is more than 500 TCP packets for a single NFS Write call. Whatever is in charge of TCP and breaking up the large PDU could be what's blocking.

I tried setting nfs:nfs3_max_transfer_size, nfs:nfs3_max_transfer_size_cots and nfs:nfs3_bsize all down to 8192: No difference, same problems. The linux guests just use their SCSI/SAS-disks, no NFS - the ESXi is the NFS-client, therefore no network driver issue on the linux guest. On the NFS server side I've tried both virtual e1000 and vmxnet3: Made no difference. As far as I know ESXi only uses TCP offloading for iSCSI. — exo_cw, Jul 09 '11 at 08:47
The biggest ? I have is why adjusting the TCP window would make a difference... My gut tells me it's something to do with fragmenting those big PDUs over TCP. Something in the networking stack that's choking on it. Just can't think of anything that would fit the behavior we are seeing. If window size was an issue we should see latency constraining bandwidth in the middle of a large transfer, not the beginning, but it's always the first packet of the RPC call... tough one. — Ryan, Jul 12 '11 at 23:51

score 2 · Answer 3 · answered Dec 17 '11 at 12:38

2

I have what looks like the same issue using ESXi4.1U1 and CentOS VM's. The hosts are Dell R610s, storage is an EMC2 Isilon cluster.

Were you by any chance using VLANS? I found using a VLAN on the VMkernel port for storage caused the 4000-5000ms 'hangs' for all storage traffic on the VMHost. However if I move the VMkernel port off the VLAN so it receives untagged packets I don't see the issue.

The simple setup below will cause the problem on my network:

1)Install ESXi 4.1U1 on a server or workstation (both exhibited the issue when I tried)

2)Add a VMkernel port on a VLAN.

3)Add an NFS Datastore (mine is on the same VLAN, i.e the Isilon receives tagged packets)

4)Install 2 CentOS 5.5 VM's, one with ioping.

5)Boot VM's into single user mode (i.e. no network, minimum services)

6)Run ioping on one machine so it's writing to it's virtual disk

7)Run dd or somesuch on the other machine to write 100MB of data to /tmp or similar

More often than not I see both VM's freezing for 4-5 seconds.

Be really interested to see if anyone else has seen similar.

answered Dec 17 '11 at 12:38

Nick

21
1

Welcome to Server Fault! This is an old question. If it's answers don't help you directly then you should ask a new NEW question by clicking the [Ask Question](http://serverfault.com/questions/ask) button. – user9517 Dec 17 '11 at 12:53
Yes, of course I'm using tagged VLANs. As I'm using them everywhere I didn't even think of them as a potential source of this problem. I am going to try to reproduce this on an untagged port. – exo_cw Dec 17 '11 at 13:02
I can reproduce this problem on an untagged port as well, no VLANs involved on that host at all. – exo_cw Dec 29 '11 at 10:03
I was just trying again and see the problem on the untagged port too, it's a little less frequent, maybe that's why I missed it. Sorry for the bum-steer. I can't see the problem on Win7 64 bit using iometer, plus it seems i can browse the c: drive whilst The other linux vms are hung. I'm going to try with crystaldiskmark – Nick Dec 29 '11 at 11:59
Actually I'd be interested to see your results with iometer on win7 x64. It measures latency but the highest overall figure I got was 300ms using the 4k read test, not 4000+ms – Nick Dec 29 '11 at 12:12

score 2 · Answer 4 · edited Jan 10 '12 at 07:56

We had exactly the same problem two weeks ago. ESX41 U1 and Netapp FAS3170 + NFS Datastores. RHEL5 VMs hangs for 2 or 4 second and we saw very high spikes from Virtual Center performance console.

I ask the network guy to check the configuration and the problem was on the cisco switch.We have two ethernet links that was configured on Etherchannel on the Netapp side and not on the cisco side. He creates a static Ethechannel on the cisco and now it works fine. To identify this kind of problem, shut down all the port except one between the filer and the switch. Leave just one port alive and see how the things going.

The second thing we do that was to remove the Flow Control on the switcj and the filer because we suspect it to send pause frame.

score 1 · Answer 5 · answered Jul 06 '11 at 15:30

1

How does your DNS look? Is your /etc/resolv.conf correct? The default timeout is 5 seconds.

From man resolv.conf

timeout:n
                 sets the amount of time the  resolver  will  wait  for  a
                 response  from  a  remote name server before retrying the
                 query via a different name server.  Measured in  seconds,
                 the default is RES_TIMEOUT (currently 5, see <resolv.h>).

Try appending timeout:3 to your /etc/resolv.conf and then run your fsync tests again.

answered Jul 06 '11 at 15:30

Joseph Kern

9,809
3
31
55

I tried adding this on the NFS server (OpenIndiana in this case) and on the ESXi host. Unfortunately this does not make a difference. I can resolve the server & guest IP just fine. – exo_cw Jul 06 '11 at 15:45
looks like you filtered out all traffic not related to the nfs stream, we might need to see more! – tony roth Jul 06 '11 at 15:56
@tony roth: Actually that is the whole traffic at that time. I tested that on a separate vSwitch with just the host and the NFS-server on it. – exo_cw Jul 06 '11 at 16:12
Can you dump DNS with wireshark? – Joseph Kern Jul 06 '11 at 16:48
@Joseph Kern: I've just analyzed the capture files again: There was no DNS traffic at all during my captures. The NFS datastore is mapped by IP on the ESXi host. DNS works fine on the ESXi and the NFS server, I tested forward & reverse lookup of all involved IPs. Right now I have no reason to believe that DNS is the cause of this. – exo_cw Jul 06 '11 at 18:47
was wondering if it was even using dns, so that answers that! – tony roth Jul 06 '11 at 19:33

score 1 · Answer 6 · answered Jul 08 '11 at 17:00

1

Grasping at straws here, but what NICs are you using in these servers? The Stack Overflow sysadmins have had weird networking issues with Broadcom NICs that went away when they switched to Intel NICs: http://blog.serverfault.com/post/broadcom-die-mutha/

answered Jul 08 '11 at 17:00

zippy

1,708
3
18
36

The last tests were done on a vSwitch only, no physical network involved (e1000 and vmxnet3: made no difference). But I've also tested this on Intel 82574L, Intel 82576 and Intel 82567LF-3, all showing the problem. I did not find any hardware yet where I can't reproduce this. – exo_cw Jul 09 '11 at 08:52

score 1 · Answer 7 · answered Jul 12 '11 at 02:04

1

Here is another guess... Is your IPv6 enabled on EXS host? If yes, then try turning it off? From my experience if your entire network is not properly configured for IPv6 (i.e. RADV, DHCP6, DNS, reverse DNS) it may be a problem for some services. Also, make sure it is off on NFS server.

answered Jul 12 '11 at 02:04

dtoubelis

4,579
1
28
31

IPv6 was already disabled on the ESXi host. I've disabled IPv6 on the NFS server (ifconfig -a6 is empty now), but it doesn't make a difference: It shows the same problems. – exo_cw Jul 12 '11 at 07:04

Troubleshooting latency spikes on ESXi NFS datastores

7 Answers7