4

I'm building a new database server and am trying to make sure I have the setup as correct as possible. The database storage is 22x 15.7k RPM Western Digital SAS drives in RAID10, in an external SAS enclosure with two RAID controllers. Initially, I created just a single volume on the SAS array, formatted with ext2. I calculated the Stride and Stripe-Width as follows:

  • chunk_size = 128 Kb
  • block_size = 4 Kb
  • total_disks = 22
  • data_disks = 11
  • stride = chunk_size / block_size = 128 Kb / 4 Kb = 32 Kb
  • stripe-width = stride * data_disks = 32 Kb * 11 = 352 Kb

I gave the Stride and Stripe-Width values to mkfs.ext2 when I created the filesystem.

I then tried another setup, which is where my question begins. I created two volumes on the SAS array, each with a different "primary" controller. When exported to the host, I used LVM2 to initialize them as Physical Volumes, created a Volume Group, and then passed '--stripes 2' to lvcreate in order to stripe the volumes. The purpose of this was to spread the IO load across both controllers, for hopefully higher performance.

There is a section in the "best practices" documentation for the SAS array that states, in reference to LUN segment (chunk) size:

When using a volume manager to collect multiple storage system LUNs into a Logical
Volume Manager (LVM) volume group (VG), the I/O stripe width is allocated across all of > the segments of all of the data drives in all of the LUNs. The adjusted formula becomes: LUN segment size = LVM I/O stripe width / (# of data drives/LUN * # of LUNs/VG)

Now, the Volume Group on the SAS array has 22 drives, 11 of which are data drives, and two LUNs. The Volume Group on the host has two physical volumes (the LUNs), and both LUNs access the same data drives. So given a LUN segment size of 128 Kb, would I calculate LVM I/O Stripe Width as

  • 128 = LVM_Stripe_Width / (11 * 2)
  • 128 * 22 = LVM_Stripe_Width
  • 2816 = LVM_Stripe_Width

or

  • 128 = LVM_Stripe_Width / (11 * 1)
  • 128 * 11 = LVM_Stripe_Width
  • 1408 = LVM_Stripe_Width

Then this leads me into the Stride and Stripe-width question for mkfs.ext2: Is it calculated the same way as in the initial setup, or is it different now because of the LVM layer? Would it become

  • chunk_size = 128 Kb
  • block_size = 4 Kb
  • total_disks = 2
  • data_disks = 2
  • stride = chunk_size / block_size = 128 Kb / 4 Kb = 32 Kb
  • stripe-width = stride * data_disks = 32 Kb * 2 = 64 Kb

where

  • total_disks = 2

comes from the '--stripes 2' to lvcreate?

Thanks,

Kendall

Kendall
  • 1,043
  • 12
  • 24
  • 1
    Sorry for the meta-query, but is there a particular reason you're using ext2, rather than ext3 or ext4? – anastrophe Jul 27 '11 at 17:30
  • Database storage doesn't need journaling, so I tossed ext3/4. I'm aware I could simply create an ext3/4 filesystem without a journal, but I'm not sure what the point in that is --by that I mean, I'm not aware of significant improvements in those filesystems that would favor them over ext2. – Kendall Jul 27 '11 at 17:43
  • 2
    I don't think this works the way you think it does, by default ext3 doesn't journal the data changes, only the filesystem metadata changes which are not being journaled by the database. Also ext4 has very different disk allocation characteristics which may be of note. I don't know if ext2 supports online resize, ext3/4 does. – mtinberg Jul 27 '11 at 18:04
  • A recent discussion with a gentleman in #postgres is leading me to re-evaluate my position to re-consider ext3 in writeback mode. Regardless though of ext2/3/4, the stripe/stride/lvm IO question remains the same. – Kendall Jul 27 '11 at 18:13
  • There are a LOT more features in ext4 that make it a vastly superior filesystem to ext2 besides journaling. You also will be very sorry the first time such a large ext2 fs needs fscked after a crash without a journal. Even if you disable the journal, ext4 fscks a lot faster, and has much better allocation policies, supports online resize, handles large files more efficiently ( extents ), etc. – psusi Aug 06 '11 at 23:06
  • 1
    How exactly did you distribute the 22 drives over 2 LUNs? It seems to me that doing that is quite useless, if your raid controllor + enclosure are worth their money. – jap Sep 06 '11 at 06:46
  • I don`t think it is worth to fine-tune on this level. You are currently optimizing for a database that has sequential huge write block operations. You are more concerned with performance than with availability - so it seems to me. Is this really what you want to do? Do you already know the usage pattern of the target database? BTW - which DB product? – Nils Jan 13 '12 at 21:37

2 Answers2

1

I could suggest checking this answer for ideas. I provided some feedback there regarding aligning xfs and someone commented about ext3. It may give you some clues.

Also, before aligning file system make sure your volume is aligned to stripe size as well. It can be done by properly sizing metadata and optimizing an extent size. Here is an example:

pvcreate -M2 --metadatasize 2048K --metadatacopies 2 <raw_device>
vgcreate --physicalextentsize 256M --autobackup y vg10 <pv_device>
dtoubelis
  • 4,579
  • 1
  • 28
  • 31
0

You are starting your optimisation from the wrong place - you need to start from the page size that your database uses (16KB for InnoDB (MySQL / MariaDB), 8KB for PostgreSQL), and optimise from there. I wrote an article on importance of file system alignment which you may find helpful for this sort of optimisation.

Gordan Bobić
  • 936
  • 4
  • 10