2

Problem

A Linux iSCSI initiator is seeing high service times when writing to a NetApp FAS target exposing a bunch of LUNs. Where to look for the cause and how to resolve this?

Reproduction

I am using iostats and sa from the sysstat package to do the calculation of "await" - the time a specific request is waiting on average:

dd if=/dev/urandom of=test bs=8K count=1000000 & iostat -xdm 5 sdy dm-26
[...]
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sdy               0.00  1115.60    0.10   47.50     0.00     7.94   341.68     1.63   34.05   2.00   9.52
dm-26             0.00     0.00    0.10 2032.90     0.00     7.94     8.00   328.10  161.39   0.05   9.52

In this scenario, the expected value for "await" would be one magnitude lower than the one measured by iostats. The 10 seconds worth of network traffic transmitted in the time period of the sample above are available on CloudShark. dm-26 is the device mapper device hosting the file system (single-disk NSS volume) referring to the sdy "physical" device.

Environment

Initiator and Target are placed in the same subnet. The initiator host has the IP of 192.168.20.72, the target is 192.168.20.33, traffic is switched 1GbE, Jumbo Frames are enabled (and confirmed to be in use via a network trace), iSCSI Immediate Data is enabled (and in use as per the mentioned trace), digests are not enabled.

iSCSI session information:

iscsiadm -m session -P 3
iSCSI Transport Class version 2.0-870
version 2.0-873.suse
Target: iqn.1992-08.com.netapp:sn.151745715
        Current Portal: 192.168.20.33:3260,2003
        Persistent Portal: 192.168.20.33:3260,2003
                **********
                Interface:
                **********
                Iface Name: default
                Iface Transport: tcp
                Iface Initiatorname: iqn.2015-06.de.example.dom:01:gw-cl-07
                Iface IPaddress: 192.168.20.72
                Iface HWaddress: <empty>
                Iface Netdev: <empty>
                SID: 1
                iSCSI Connection State: LOGGED IN
                iSCSI Session State: LOGGED_IN
                Internal iscsid Session State: NO CHANGE
                *********
                Timeouts:
                *********
                Recovery Timeout: 120
                Target Reset Timeout: 30
                LUN Reset Timeout: 30
                Abort Timeout: 15
                *****
                CHAP:
                *****
                username: <empty>
                password: ********
                username_in: <empty>
                password_in: ********
                ************************
                Negotiated iSCSI params:
                ************************
                HeaderDigest: None
                DataDigest: None
                MaxRecvDataSegmentLength: 262144
                MaxXmitDataSegmentLength: 65536
                FirstBurstLength: 65536
                MaxBurstLength: 65536
                ImmediateData: Yes
                InitialR2T: No
                MaxOutstandingR2T: 1
                ************************
                Attached SCSI devices:
                ************************
                Host Number: 3  State: running
                scsi3 Channel 00 Id 0 Lun: 0
                        Attached scsi disk sdb          State: running
                scsi3 Channel 00 Id 0 Lun: 1
                        Attached scsi disk sdc          State: running
                scsi3 Channel 00 Id 0 Lun: 10
                        Attached scsi disk sdl          State: running
                scsi3 Channel 00 Id 0 Lun: 11
                        Attached scsi disk sdm          State: running
                scsi3 Channel 00 Id 0 Lun: 12
                        Attached scsi disk sdn          State: running
                scsi3 Channel 00 Id 0 Lun: 13
                        Attached scsi disk sdo          State: running
                scsi3 Channel 00 Id 0 Lun: 14
                        Attached scsi disk sdp          State: running
                scsi3 Channel 00 Id 0 Lun: 15
                        Attached scsi disk sdq          State: running
                scsi3 Channel 00 Id 0 Lun: 16
                        Attached scsi disk sdr          State: running
                scsi3 Channel 00 Id 0 Lun: 17
                        Attached scsi disk sds          State: running
                scsi3 Channel 00 Id 0 Lun: 18
                        Attached scsi disk sdt          State: running
                scsi3 Channel 00 Id 0 Lun: 19
                        Attached scsi disk sdu          State: running
                scsi3 Channel 00 Id 0 Lun: 2
                        Attached scsi disk sdd          State: running
                scsi3 Channel 00 Id 0 Lun: 20
                        Attached scsi disk sdv          State: running
                scsi3 Channel 00 Id 0 Lun: 21
                        Attached scsi disk sdw          State: running
                scsi3 Channel 00 Id 0 Lun: 22
                        Attached scsi disk sdx          State: running
                scsi3 Channel 00 Id 0 Lun: 23
                        Attached scsi disk sdy          State: running
                scsi3 Channel 00 Id 0 Lun: 24
                        Attached scsi disk sdz          State: running
                scsi3 Channel 00 Id 0 Lun: 25
                        Attached scsi disk sdaa         State: running
                scsi3 Channel 00 Id 0 Lun: 26
                        Attached scsi disk sdab         State: running
                scsi3 Channel 00 Id 0 Lun: 27
                        Attached scsi disk sdac         State: running
                scsi3 Channel 00 Id 0 Lun: 28
                        Attached scsi disk sdad         State: running
                scsi3 Channel 00 Id 0 Lun: 3
                        Attached scsi disk sde          State: running
                scsi3 Channel 00 Id 0 Lun: 4
                        Attached scsi disk sdf          State: running
                scsi3 Channel 00 Id 0 Lun: 5
                        Attached scsi disk sdg          State: running
                scsi3 Channel 00 Id 0 Lun: 6
                        Attached scsi disk sdh          State: running
                scsi3 Channel 00 Id 0 Lun: 7
                        Attached scsi disk sdi          State: running
                scsi3 Channel 00 Id 0 Lun: 8
                        Attached scsi disk sdj          State: running
                scsi3 Channel 00 Id 0 Lun: 9
                        Attached scsi disk sdk          State: running

Other observations

For one reason or another, the dm device which is mapping to the "physical" LUN is showing a considerable increase in await times whenever write requests are merged in the request queue. But my question is actually about the awaits on the underlying device - the NetApp FAS should simply put all write requests into its NVRAM and confirm instantaneously, even for synchronous writes, so I never should see awaits higher than 5ms, as long as the network link is not saturated (which it isn't) and the NVRAM is not backpressured (which it isn't - the FAS currently is not handling any other load at all).

"await" times are considerably lower for read operations, even for random reads. The vizualized 10-second sysstat data from an interval where iozone was running the random read/write test (single-threaded automatic-type test run with O_DSYNC enabled, 8K block size) is showing the effect: sysstat data visualization the first half of the graph are random reads, running at ~2-4 kIOPS and ~3ms latency. In the second half, the workload is switching to random writes, await is rising to >10ms and IOPS are dropping to ~100 (the load is latency-bound and single-threaded, so IOPS are inversely proportional to the latency)

For some reason, when analyzing the network traffic capture above, Wireshark's "Service Response Time" statistics feature for SCSI fails to recognize most of the write calls finding only 19 requests and reporting a 3ms average service response time where I would expect around 500 calls and an Avg SRT value similar to the await of 34 ms.

Linux used is SuSE SLES 11 SP3, Kernel 3.0.101-0.47.55-default.

the-wabbit
  • 40,319
  • 13
  • 105
  • 169
  • I just noticed that by cutting the packet lengths to 512 bytes before uploading them to CloudShark, I have crippled the iSCSI write request headers. I am going to re-upload the file with a higher maximum packet length setting as soon as I can get at the original again. – the-wabbit Aug 11 '15 at 20:03

2 Answers2

0

Too long for a comment;

I'am not a Linux guru, but in Windows I disable TCP Large Send Offload on the NIC, as it can produce lag. It send less packets, but bigger, but for critical IO it's not recommended.

A official explanation;

The TCP Large Send Offload option allows the AIX TCP layer to build a TCP message up to 64 KB long and send it in one call down the stack through IP and the Ethernet device driver. The adapter then re-segments the message into multiple TCP frames to transmit on the wire. The TCP packets sent on the wire are either 1500-byte frames for a Maximum Transmission Unit (MTU) of 1500 or up to 9000-byte frames for a MTU of 9000 (jumbo frames).

yagmoth555
  • 16,300
  • 4
  • 26
  • 48
  • I have tried disabling tso and gso using ` ethtool -K eth1 tso off; ethtool -K eth1 gso off`, but there was no substantial change to the numbers here. It might help the troubleshooting process though - my capture had TCP "packets" of >40K in length, making it more difficult to read. – the-wabbit Aug 14 '15 at 14:00
0

I will edit this answer based on further information. First, we need to determine whether the wait is being observed by the Netapp too, or just the host. If the host sees high service time but the NAS claims to have low service time, it's something between the NAS ports and the server's SCSI stack.

What version of data ontap are you running? 7-mode or CDOT? What is the LUN OS setting and the igroup OS setting? With this information, I can provide you commands you can use on the Netapp to check the storage-observed latency.

Basil
  • 8,811
  • 3
  • 37
  • 73