I am rebuilding one drive of an 8-drive RAID6 (using 'md' Linux software RAID), and have noticed that it doesn't seem to be going as fast as it could, presumably because one of the drives is being sent twice as many IOPS as the others:
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 155.00 77252.00 0.00 77252 0
sdb 153.00 76736.00 0.00 76736 0
sdc 154.00 77248.00 0.00 77248 0
sde 154.00 77248.00 0.00 77248 0
sdf 164.00 77288.00 0.00 77288 0
sdd 154.00 77248.00 0.00 77248 0
sdg 287.00 83160.00 0.00 83160 0
sdh 146.00 0.00 74240.00 0 74240
(sdh is being rebuild, and sdg is getting more IOPS than I would expect).
(I used mdadm /dev/md1 --add /dev/sdh4 to add the replacement drive, having failed/removed the existing one).
Things that (I think) I have eliminated:
All drives have identical partition layouts (copied using sgdisk).
sda-sdg are identical drives with the same model number (sdh is new).
I've looked at readahead, block size, multcount on all of the drives and can't spot any difference that sdp might have compared to the others.
A different rebuild on the same machine had the same problem (sdg being accessed more), so I removed the write intent bitmap beforehand this time, but that hasn't helped.
The board (ASRock P67 Extreme6) has an oddly heterogeneous SATA provision, with two SATA3 ports and six SATA6 ports (two from the chipset and four from an onboard Marvell SE9120 interface). It is possible that sdg is on the port which is also shared with the eSATA socket, but it claims to be using UDMA6 just like the others, so I can't see what effect that would have.
Any ideas why the tps (IOPS) on sdg is twice the others?
UPDATE: Further clarification:
The drives are 3 year old 3TB Seagate Barracudas (whilst I don't usually get involved with drive-brand-anecdotes, one of the 8 drives has failed, and three others (but not sdg) are showing bad signs (unrecoverable errors, multiple reallocated sectors): these aren't the most reliable drives i've ever used).`I'm pretty sure they're boring PMR.
Once the RAID had recovered, accesses are now spread evenly between all disks, with a similar number of IOPS for each drive. Thus I would be surprised if link speed was relevant (although md could be doing strange 'optimisations' I suppose).
I didn't get the chance to grab the output of 'iostat x' before the RAID had finished recovering, but from memory, sdg was at 100% utilisation, and had a large request queue size (in the 100s), whilst the others were at 50-60% utilisation and had a single-digit request queue size.
I guess I would need to swap sdg and another drive around to fully eliminate whether it is the controller/md or the drive.
UPDATE #2: Different rebuild, same issue
This time I am rebuilding sdb:
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 13813.00 0.00 184.50 0.00 54.06 0.00 600.11 23.60 114.11 114.11 0.00 2.49 46.00
sdb 0.00 12350.50 0.00 97.50 0.00 48.62 1021.37 0.17 1.70 0.00 1.70 1.31 12.80
sdd 12350.00 0.00 98.00 0.00 48.62 0.00 1016.16 5.47 55.82 55.82 0.00 2.82 27.60
sdc 12350.00 0.00 98.00 0.00 48.62 0.00 1016.16 5.92 60.41 60.41 0.00 2.84 27.80
sde 12350.00 0.00 98.00 0.00 48.62 0.00 1016.16 6.11 62.39 62.39 0.00 3.02 29.60
sdf 12350.50 0.00 97.50 0.00 48.62 0.00 1021.37 14.56 149.33 149.33 0.00 3.92 38.20
sdg 12350.00 0.00 98.00 0.00 48.62 0.00 1016.16 7.18 73.31 73.31 0.00 3.16 31.00
sdh 12350.00 0.00 98.00 0.00 48.62 0.00 1016.16 5.27 53.80 53.80 0.00 2.88 28.20
As you can see, sda is getting a lot more accesses than the others (I am throttling it so that sda isn't hitting 100% utilisation, although it will if I don't). Interestingly, the 'avgrq-sz' (Average request size) of sda is lower, suggesting that the extra accesses are much smaller. Now I just need to find a way of working out what they are!