Troubleshooting strategy for very poor iSCSI/NFS performance

Question

We have a new Synology RS3412RPxs that offers iSCSI targets to three Windows 2008 R2 boxes and NFS to one OpenBSD 5.0 box.

Logging into the RS3412 with ssh and reading/writing both small files and 6GB files using dd and various blocksizes show great disk I/O performance.

Using dd or iometer on the iSCSI/NFS clients, we reach up to 20Mbps (That's not a typo. Twenty Mbps). We were kinda hoping to make better use of the multiple Gbit NICs in the Synology.

I've verified switch and NIC port configuration is set to gigabit, not auto-negotiate. We've tried with and without Jumboframes with no difference. I've verified with ping that the MTU is currently 9000. Two firmware upgrades have been deployed.

I am going to try direct link between the iSCSI target and initiator to rule out switch problems, but what are my other options?

If I break out wireshark/tcpdump, what do I look for?

@SpacemanSpiff: Flow control is not enabled. Would you expect that to make a difference? It's a ZyXEL GS2200. — Alex Holst, Apr 20 '12 at 20:15
Kind of a wimpy backplane, but enough to get better performance than that. Curious to see what the crossover cable gets you performance wise. — SpacemanSpiff, Apr 20 '12 at 20:17

score 4 · Accepted Answer · answered Apr 20 '12 at 21:13

4

As seems to be the common theme here, take another look at the flow control settings on the switch(es). If the switch(es) have Ethernet counter statistics take a look at them and see if there are a large number of Ethernet PAUSE frames. If so, that's probably your problem. In general, disabling QOS on the switch(es) resolves this problem.

answered Apr 20 '12 at 21:13

joeqwerty

108,377
6
80
171

I took another look. Flow control was disabled and PAUSE counters were zero on all interfaces. Enabling flow control made PAUSE counters shoot up by 25% of the packet count. We've identified some hardware that doesn't show the same weak performance so now we're looking to update nic drivers and replace certain nics with more capable ones. QoS was already disabled on the switch. Thanks for your input. – Alex Holst Apr 25 '12 at 13:35
Glad to help... – joeqwerty Apr 25 '12 at 16:45

score 3 · Answer 2 · answered Apr 20 '12 at 20:18

Flows like that suggest to me that the various TCP flow-controls methods aren't working right. I've seen some problems with Linux-kernels talking with post-Vista Windows versions and you get throughputs like that. They tend to show up pretty well in Wireshark once you take a look.

The absolute worst possibility is that TCP delayed ack is completely broken and you'll see a traffic pattern that looks like:

packet
packet
[ack]
packet
packet
[ack]

I've solved that one by applying NIC driver updates to the Windows servers. The smart NICs that come with some (broadcom) servers can sometimes fail in interesting ways, and this is one.

A normal traffic pattern would be a large number of packets followed by an Ack packet.

The other thing to look for are long delays. Suspicious values are .2 seconds and 1.0 seconds. That suggests that one side isn't getting what it's expecting and is waiting for a timeout to expire before replying. Combine the above bad packet pattern with a 200ms delay for the ACK and you get throughputs of a whopping 1MB/s.

Those are the easy-to-notice bad traffic patterns.

I haven't worked with that kind of NAS device so don't know how tweakable it is to fix whatever is found.

Also check out these:http://support.microsoft.com/kb/982383 http://support.microsoft.com/kb/2522766 http://support.microsoft.com/kb/2460971 http://support.microsoft.com/kb/251196 — SpacemanSpiff, Apr 20 '12 at 20:22

Troubleshooting strategy for very poor iSCSI/NFS performance

2 Answers2