Why higher await time for DM multipath device than underlying device?

Question

We have a CentOS 6.4 based server attached to Hitachi HNAS 3080 storage and observed the kernel remount the filesystem in read-only mode:

May 16 07:31:03 GNS3-SRV-CMP-001 kernel: [1259725.675814] EXT3-fs (dm-1): error: remounting filesystem read-only

This happened after several I/O errors and all paths to the device reportedly going down:

May 16 07:31:03 GNS3-SRV-CMP-001 multipathd: mpatha: remaining active paths: 0

I have been looking at sar logs and can see few very large (2 seconds) await times:

07:40:00       dev8-0     17.91    112.04     98.03     11.73      0.00      0.20      0.07      0.12
07:40:00      dev8-16      0.23      1.85      0.00      8.00      0.00      3.71      3.71      0.09
07:40:00      dev8-32     91.50   8338.76   5292.93    148.98      8.38     91.60      9.76     89.35
07:40:00     dev252-0     91.27   8336.91   5292.93    149.34     17.79    194.88      9.79     89.38
07:40:00     dev252-1    674.80   8168.16   5292.93     19.95   1473.53   2183.60      1.32     88.98

The duration between 07:30:00-07:40:00 does happen the time when the filesystem got mounted read-only. However, even under normal conditions, one repeated observation is that the await time for underlying devices is much lower than that of the multipath device. For instance:

00:00:00          DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz     await     svctm     %util
00:10:00       dev8-0     19.27    129.41     78.61     10.80      0.01      0.27      0.16      0.32
00:10:00      dev8-16      0.23      1.80      0.00      8.00      0.00      0.86      0.84      0.02
00:10:00      dev8-32     94.88  10285.16   3363.48    143.86      3.39     35.76      6.83     64.82
00:10:00     dev252-0     94.65  10283.34   3363.48    144.18      3.64     38.47      6.86     64.89
00:10:00     dev252-1    435.06  10087.12   3363.48     30.92    118.42    272.21      1.47     64.12

dev8-0 happens to be the local disk, while dev8-16 (/dev/sdb) and dev8-32 (/dev/sdc) are the underlying ones for dev252-0 (/dev/mapper/mpatha). dev252-1 (/dev/mapper/mpathap1) is a single partition spanning the whole of the multipath device. Here is output from multipath -ll:

mpatha (2521501cbffffffffe96773b50ec30020) dm-0 BlueArc,NAS Platform
size=10T features='0' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=1 status=enabled
| `- 9:0:0:0 sdc 8:32 active ready running
`-+- policy='round-robin 0' prio=1 status=active
  `- 8:0:0:0 sdb 8:16 active ready running

Why should the await time for /dev/mapper/mpathap1 be so much higher than that of /dev/mapper/mpatha or even /dev/sdb or /dev/sdc?

It seems noteworthy that apparently a lot of request merging is happening on the way from `/dev/mapper/mpathap1` to `/dev/mapper/mpatha`. This is also the layer where most of the `await` time seems to be added. Can you check which elevators are used in `/sys/block/mpathap1/queue/scheduler` and `/sys/block/mpatha/queue/scheduler`, possibly switching it to `deadline` or `noop` for comparison? — the-wabbit, Jun 01 '15 at 12:27
The [I/O scheduler](http://www.redhat.com/magazine/008jun05/features/schedulers/) for `mpatha` (`/sys/block/dm-0/queue/scheduler`) is `noop` and that for `mpathap1` (`/sys/block/dm-1/queue/scheduler`) is `none`. — pdp, Jun 02 '15 at 16:07
@the-wabbit, the elevator for underlying devices, `/dev/sd{b,c}` is `cfq`. Is it required to change them to `noop` as well? I will could try to change the elevator for `dm-0` to `deadline` and see if that makes any difference. — pdp, Jun 03 '15 at 05:53
I strongly suspect that the scheduler's queueing/merging algorithm is responsible for the delay. I would swap cfq of the underlying devices for noop or deadline just to see if it changes anything. This likely will be unrelated to your all paths down issue, though. — the-wabbit, Jun 09 '15 at 19:57
Sounds like one of your drives might be going bad. Are you able to take your system down and run an `fsck` on it? (obviously, do a backup of the SAN/system first) — CIA, Jul 14 '15 at 17:10
anyway I don't see the sense to partition the luns, LUN(LOGICAL UNIT) and why I need to do other partition in my OS side? — c4f4t0r, Jul 15 '15 at 13:40
How are you using that device? Are you doing LVM or something similar? I have seen this type of behavior when LVM has not been setup to use the multipath device vs. the physical devices. — Grim76, Jul 15 '15 at 13:29
FWIW, I have observed the same kind of behaviour on other types of device mapper devices - specifically with [NSS pools](https://en.wikipedia.org/wiki/Novell_Storage_Services). Merge-able writes do have a higher await (and longer queues) on the `dm` device than on the underlying physical device while read requests and writes without any merging done are mainly unaffected. I do not know yet if this is simply a presentation error due to the way awaits are calculated or actually prolonged response times due to the nature of the queueing / merging algorithm. — the-wabbit, Aug 08 '15 at 17:39
One of the [Systemtap IO scripts](https://sourceware.org/systemtap/examples/keyword-index.html#IO) could possibly provide you with additional insight into what is going on. io_submit.stp, ioblktime.stp, and biolatency-nd.stp might be good places to start. — Kassandry, Sep 06 '15 at 09:29

score 2 · Answer 1 · answered Jan 21 '16 at 20:50

As user the-wabbit suggests, there is request-merging going on. You can see that in the column avgrq-sz, the average request size - which shows a significant increase.

Now 'await' is the time spent in the queue plus the time spent servicing those requests. If a small request, let's call it 'x', is merged with a couple of other requests (y and z, issued after x), then x will

wait in the queue to be merged with y
wait in the queue tu be merged with z
wait for (x,y,z) to be completed

This will obviously have a negative impact on the await statistic, mostly because of the way await is calculated, without actually signifying a problem in itself.

Now let's take a look at /dev/sdb (dev8-16). Did you know that you are not using that path? You have two priority groups in your multipath config, one is

status=enabled

and on is

status=active

You probably have

path_grouping_policy    failover

in your configuration (which is the default).

If you want to prevent the IO errors in case both paths are down, you could try:

        features        "1 queue_if_no_path"

in your multipath.conf

Now the real question remains, why do both paths go down?

Why higher await time for DM multipath device than underlying device?

1 Answers1