Problem
A Linux iSCSI initiator is seeing high service times when writing to a NetApp FAS target exposing a bunch of LUNs. Where to look for the cause and how to resolve this?
Reproduction
I am using iostats
and sa
from the sysstat package to do the calculation of "await" - the time a specific request is waiting on average:
dd if=/dev/urandom of=test bs=8K count=1000000 & iostat -xdm 5 sdy dm-26
[...]
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
sdy 0.00 1115.60 0.10 47.50 0.00 7.94 341.68 1.63 34.05 2.00 9.52
dm-26 0.00 0.00 0.10 2032.90 0.00 7.94 8.00 328.10 161.39 0.05 9.52
In this scenario, the expected value for "await" would be one magnitude lower than the one measured by iostats
. The 10 seconds worth of network traffic transmitted in the time period of the sample above are available on CloudShark. dm-26
is the device mapper device hosting the file system (single-disk NSS volume) referring to the sdy
"physical" device.
Environment
Initiator and Target are placed in the same subnet. The initiator host has the IP of 192.168.20.72, the target is 192.168.20.33, traffic is switched 1GbE, Jumbo Frames are enabled (and confirmed to be in use via a network trace), iSCSI Immediate Data is enabled (and in use as per the mentioned trace), digests are not enabled.
iSCSI session information:
iscsiadm -m session -P 3
iSCSI Transport Class version 2.0-870
version 2.0-873.suse
Target: iqn.1992-08.com.netapp:sn.151745715
Current Portal: 192.168.20.33:3260,2003
Persistent Portal: 192.168.20.33:3260,2003
**********
Interface:
**********
Iface Name: default
Iface Transport: tcp
Iface Initiatorname: iqn.2015-06.de.example.dom:01:gw-cl-07
Iface IPaddress: 192.168.20.72
Iface HWaddress: <empty>
Iface Netdev: <empty>
SID: 1
iSCSI Connection State: LOGGED IN
iSCSI Session State: LOGGED_IN
Internal iscsid Session State: NO CHANGE
*********
Timeouts:
*********
Recovery Timeout: 120
Target Reset Timeout: 30
LUN Reset Timeout: 30
Abort Timeout: 15
*****
CHAP:
*****
username: <empty>
password: ********
username_in: <empty>
password_in: ********
************************
Negotiated iSCSI params:
************************
HeaderDigest: None
DataDigest: None
MaxRecvDataSegmentLength: 262144
MaxXmitDataSegmentLength: 65536
FirstBurstLength: 65536
MaxBurstLength: 65536
ImmediateData: Yes
InitialR2T: No
MaxOutstandingR2T: 1
************************
Attached SCSI devices:
************************
Host Number: 3 State: running
scsi3 Channel 00 Id 0 Lun: 0
Attached scsi disk sdb State: running
scsi3 Channel 00 Id 0 Lun: 1
Attached scsi disk sdc State: running
scsi3 Channel 00 Id 0 Lun: 10
Attached scsi disk sdl State: running
scsi3 Channel 00 Id 0 Lun: 11
Attached scsi disk sdm State: running
scsi3 Channel 00 Id 0 Lun: 12
Attached scsi disk sdn State: running
scsi3 Channel 00 Id 0 Lun: 13
Attached scsi disk sdo State: running
scsi3 Channel 00 Id 0 Lun: 14
Attached scsi disk sdp State: running
scsi3 Channel 00 Id 0 Lun: 15
Attached scsi disk sdq State: running
scsi3 Channel 00 Id 0 Lun: 16
Attached scsi disk sdr State: running
scsi3 Channel 00 Id 0 Lun: 17
Attached scsi disk sds State: running
scsi3 Channel 00 Id 0 Lun: 18
Attached scsi disk sdt State: running
scsi3 Channel 00 Id 0 Lun: 19
Attached scsi disk sdu State: running
scsi3 Channel 00 Id 0 Lun: 2
Attached scsi disk sdd State: running
scsi3 Channel 00 Id 0 Lun: 20
Attached scsi disk sdv State: running
scsi3 Channel 00 Id 0 Lun: 21
Attached scsi disk sdw State: running
scsi3 Channel 00 Id 0 Lun: 22
Attached scsi disk sdx State: running
scsi3 Channel 00 Id 0 Lun: 23
Attached scsi disk sdy State: running
scsi3 Channel 00 Id 0 Lun: 24
Attached scsi disk sdz State: running
scsi3 Channel 00 Id 0 Lun: 25
Attached scsi disk sdaa State: running
scsi3 Channel 00 Id 0 Lun: 26
Attached scsi disk sdab State: running
scsi3 Channel 00 Id 0 Lun: 27
Attached scsi disk sdac State: running
scsi3 Channel 00 Id 0 Lun: 28
Attached scsi disk sdad State: running
scsi3 Channel 00 Id 0 Lun: 3
Attached scsi disk sde State: running
scsi3 Channel 00 Id 0 Lun: 4
Attached scsi disk sdf State: running
scsi3 Channel 00 Id 0 Lun: 5
Attached scsi disk sdg State: running
scsi3 Channel 00 Id 0 Lun: 6
Attached scsi disk sdh State: running
scsi3 Channel 00 Id 0 Lun: 7
Attached scsi disk sdi State: running
scsi3 Channel 00 Id 0 Lun: 8
Attached scsi disk sdj State: running
scsi3 Channel 00 Id 0 Lun: 9
Attached scsi disk sdk State: running
Other observations
For one reason or another, the dm
device which is mapping to the "physical" LUN is showing a considerable increase in await times whenever write requests are merged in the request queue. But my question is actually about the awaits on the underlying device - the NetApp FAS should simply put all write requests into its NVRAM and confirm instantaneously, even for synchronous writes, so I never should see awaits higher than 5ms, as long as the network link is not saturated (which it isn't) and the NVRAM is not backpressured (which it isn't - the FAS currently is not handling any other load at all).
"await" times are considerably lower for read operations, even for random reads. The vizualized 10-second sysstat data from an interval where iozone
was running the random read/write test (single-threaded automatic-type test run with O_DSYNC enabled, 8K block size) is showing the effect:
the first half of the graph are random reads, running at ~2-4 kIOPS and ~3ms latency. In the second half, the workload is switching to random writes, await is rising to >10ms and IOPS are dropping to ~100 (the load is latency-bound and single-threaded, so IOPS are inversely proportional to the latency)
For some reason, when analyzing the network traffic capture above, Wireshark's "Service Response Time" statistics feature for SCSI fails to recognize most of the write
calls finding only 19 requests and reporting a 3ms average service response time where I would expect around 500 calls and an Avg SRT value similar to the await of 34 ms.
Linux used is SuSE SLES 11 SP3, Kernel 3.0.101-0.47.55-default.