10

I've set-up a Linux software raid level 5 consisting of 4 * 2 TB disks. The disk array was created with a 64k stripe size and no other configuration parameters. After the initial rebuild I tried to create a filesystem and this step takes very long (about half an hour or more). I tried to create an xfs and ext3 filesystem, both took a long time, with mkfs.ext3 I observed the following behaviour, which might be helpful:

  • writing inode tables runs fast until it reaches 1053 (~ 1 second), then it writes about 50, waits for two seconds, then the next 50 are written (according to the console display)
  • when I try to cancel the operation with Control+C it hangs for half a minute before it is really canceled

The performance of the disks individually is very good, I've run bonnie++ on each one separately with write / read values of around 95 / 110MB/s. Even when I run bonnie++ on every drive in parallel the values are only reduced by about 10 MB. So I'm excluding hardware / I/O scheduling in general as a problem source.

I tried different configuration parameters for stripe_cache_size and readahead size without success, but I don't think they are that relevant for the file system creation operation.

The server details:

  • Linux server 2.6.35-27-generic #48-Ubuntu SMP x86_64 GNU/Linux
  • mdadm - v2.6.7.1

Does anyone has a suggestion on how to further debug this?

user9517
  • 114,104
  • 20
  • 206
  • 289
Elmar Weber
  • 482
  • 1
  • 7
  • 17

4 Answers4

6

I suspect you're running into the typical RAID5 small write issue. For writes under the size of a stripe size, it has to do a read-modify-write for both the data and the parity. If the write is the same size as the stripe, it can simply overwrite the parity, since it knows what the value is, and doesn't have to recalculate it.

malcolmpdx
  • 2,250
  • 1
  • 15
  • 12
  • Would make sense, am I seeing this correctly?: According to the mkfs.ext3 output it writes about 25 inode tables a second, I'm assuming they are smaller than 64k during the initial creation, so a 64k stripe is written. This would mean a 16k write to each disk, so together 25 random 16k writes per second, with a 4kb sector size this means 100 random i/o operations per seconds, which is about what bonnie++ showed. – Elmar Weber Mar 20 '11 at 15:31
  • Matches the result from bonnie++ on the actual raid, 335 MB read and 310 MB write, however file creation and deletion is only 1/4 of single disk performance. – Elmar Weber Mar 20 '11 at 16:43
4

I agree, that it may be related to stripe alignment. From my experience creation of unaligned XFS on 3*2TB RAID-0 takes ~5 minutes but if it is aligned to stripe size it is ~10-15 seconds. Here is a command for aligning XFS to 256KB stripe size:

mkfs.xfs -l internal,lazy-count=1,sunit=512 -d agsize=64g,sunit=512,swidth=1536 -b size=4096 /dev/vg10/lv00

BTW, stripe width in my case is 3 units, which will be the same for you with 4 drives but in raid-5.

Obviously, this also improves FS performance, so you better keep it aligned.

dtoubelis
  • 4,579
  • 1
  • 28
  • 31
  • Hi, this did not make any difference, I tried: `time mkfs.xfs -l sunit=128 -d agsize=64g,sunit=128,swidth=512 -b size=4096 /dev/md0 -f` which took roughly the same time as mkfs without any parameters – Elmar Weber Mar 22 '11 at 15:37
  • I'm running bonnie++ so see if it makes any performance difference during operation. btw: is there any reason for the agsize parameter? I read the man page but could not deduce the benefit of setting it to a value. – Elmar Weber Mar 22 '11 at 15:43
  • (btw: above command was wrong, correct swidth was 384) – Elmar Weber Mar 22 '11 at 18:21
  • I didn't get any performance boost on mkfs, but the overall performance measured with bonnie++ is much better: File Create/Delete Operations are about 4 times better than before and sequential write speed about 15%. Thanks a lot. – Elmar Weber Mar 22 '11 at 18:22
  • 2
    agsize is not really necessary here - mkfs will calculate it automatically (likelly dividing size of the volume by number of logical CPUs). It is leftover from my own setup - I created this volume with some expectation for future configuration change. – dtoubelis Mar 23 '11 at 03:38
  • Regarding creation time, it is likely a mistake. I was also playing with agsize and it could be affected by it. I started with agsize=256m based on someone else's recommendation but then realized it is not my case. – dtoubelis Mar 23 '11 at 03:42
  • Elmar, if your stripe size is 64KB then swidth would be 384, however sunit has to be 128 in that case. The command I provided is for stripe size 256K. – dtoubelis Mar 23 '11 at 03:47
3

Your mkfs and subsequent filesystem performance might improve if you specify the stride and stripe width when creating the filesystem. If you are using the default 4k blocks, your stride is 16 (RAID stripe of 64k divided by filesystem block of 4k) and your stripe width is 48 (filesystem stride of 16 multiplied by the 3 data disks in your array).

mkfs.ext3 -E stride=16 stripe-width=48 /dev/your_raid_device
sciurus
  • 12,493
  • 2
  • 30
  • 49
0

You should really look at the block group size (-g option on mkfs.ext*). I know the man page says you can ignore this, but my experience very much shows that the man page is badly wrong on this. You should adjust your block group size in a way that ensures that your block groups don't all start on the same disk, but instead rotate evenly around all the disks. It makes a very obvious difference to performance. I wrote an article on how to optimise file system alignment which you may find useful.

Gordan Bobić
  • 936
  • 4
  • 10