2

I am rebuilding one drive of an 8-drive RAID6 (using 'md' Linux software RAID), and have noticed that it doesn't seem to be going as fast as it could, presumably because one of the drives is being sent twice as many IOPS as the others:

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             155.00     77252.00         0.00      77252          0
sdb             153.00     76736.00         0.00      76736          0
sdc             154.00     77248.00         0.00      77248          0
sde             154.00     77248.00         0.00      77248          0
sdf             164.00     77288.00         0.00      77288          0
sdd             154.00     77248.00         0.00      77248          0
sdg             287.00     83160.00         0.00      83160          0
sdh             146.00         0.00     74240.00          0      74240

(sdh is being rebuild, and sdg is getting more IOPS than I would expect).

(I used mdadm /dev/md1 --add /dev/sdh4 to add the replacement drive, having failed/removed the existing one).

Things that (I think) I have eliminated:

  1. All drives have identical partition layouts (copied using sgdisk).

  2. sda-sdg are identical drives with the same model number (sdh is new).

  3. I've looked at readahead, block size, multcount on all of the drives and can't spot any difference that sdp might have compared to the others.

  4. A different rebuild on the same machine had the same problem (sdg being accessed more), so I removed the write intent bitmap beforehand this time, but that hasn't helped.

  5. The board (ASRock P67 Extreme6) has an oddly heterogeneous SATA provision, with two SATA3 ports and six SATA6 ports (two from the chipset and four from an onboard Marvell SE9120 interface). It is possible that sdg is on the port which is also shared with the eSATA socket, but it claims to be using UDMA6 just like the others, so I can't see what effect that would have.

Any ideas why the tps (IOPS) on sdg is twice the others?

UPDATE: Further clarification:

  1. The drives are 3 year old 3TB Seagate Barracudas (whilst I don't usually get involved with drive-brand-anecdotes, one of the 8 drives has failed, and three others (but not sdg) are showing bad signs (unrecoverable errors, multiple reallocated sectors): these aren't the most reliable drives i've ever used).`I'm pretty sure they're boring PMR.

  2. Once the RAID had recovered, accesses are now spread evenly between all disks, with a similar number of IOPS for each drive. Thus I would be surprised if link speed was relevant (although md could be doing strange 'optimisations' I suppose).

  3. I didn't get the chance to grab the output of 'iostat x' before the RAID had finished recovering, but from memory, sdg was at 100% utilisation, and had a large request queue size (in the 100s), whilst the others were at 50-60% utilisation and had a single-digit request queue size.

I guess I would need to swap sdg and another drive around to fully eliminate whether it is the controller/md or the drive.

UPDATE #2: Different rebuild, same issue

This time I am rebuilding sdb:

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda           13813.00     0.00  184.50    0.00    54.06     0.00   600.11    23.60  114.11  114.11    0.00   2.49  46.00
sdb               0.00 12350.50    0.00   97.50     0.00    48.62  1021.37     0.17    1.70    0.00    1.70   1.31  12.80
sdd           12350.00     0.00   98.00    0.00    48.62     0.00  1016.16     5.47   55.82   55.82    0.00   2.82  27.60
sdc           12350.00     0.00   98.00    0.00    48.62     0.00  1016.16     5.92   60.41   60.41    0.00   2.84  27.80
sde           12350.00     0.00   98.00    0.00    48.62     0.00  1016.16     6.11   62.39   62.39    0.00   3.02  29.60
sdf           12350.50     0.00   97.50    0.00    48.62     0.00  1021.37    14.56  149.33  149.33    0.00   3.92  38.20
sdg           12350.00     0.00   98.00    0.00    48.62     0.00  1016.16     7.18   73.31   73.31    0.00   3.16  31.00
sdh           12350.00     0.00   98.00    0.00    48.62     0.00  1016.16     5.27   53.80   53.80    0.00   2.88  28.20

As you can see, sda is getting a lot more accesses than the others (I am throttling it so that sda isn't hitting 100% utilisation, although it will if I don't). Interestingly, the 'avgrq-sz' (Average request size) of sda is lower, suggesting that the extra accesses are much smaller. Now I just need to find a way of working out what they are!

jonny5532
  • 21
  • 2

1 Answers1

1

My initial guess was that md had identified a problem with sdg, and was trying to pull data off of it "sooner" so that it could be replaced, too.

That's not how md works, though (some hardware controllers might happen to do that - unsure).

Lots of drives in an array slow down rebuilds (pdf) - from a rebuild perspective, fewer drives in the array is "better".

Further exploration leads to both a possible conclusion, and a few follow-on questions:

  • what size are the drives?
  • what type are they - enterprise or desktop?
  • what brand are they - WD, Seagate, Hitachi, other, a mix?
  • what type of drives are in the array - PMR or SMR?

From this review of a Seagate drive, it seems that rebuilds with SMR (which are denser-packed) drives are unusually inconsistent in speed, where as PMR is more consistent.

My tentative conclusion is that

  1. different SATA port speeds are not helping here - that, I think, should be obvious to all involved :)
  2. you have either differently-branded drives in the array, or they are very large, or they are not designed to handle rebuilds "better" (PMR) - or a mix of the above
warren
  • 17,829
  • 23
  • 82
  • 134
  • I may be misunderstanding something, but I can't see why the performance of the drive would have an effect on the number of IOPS being issued to it - sure, the throughput might be lower, but I'd expect a sensible RAID rebuild to require an roughly equal number of reads to all of the drive members. – jonny5532 Feb 16 '16 at 09:24
  • @jonny5532 - I'm going based on what I have found over several hours of hunting to seem to be the most likely response – warren Feb 16 '16 at 14:58