Filesystem/Options for intensive random I/O

Question

I'm planning to (privately) deploy a server that will be hammered with random I/Os in files ranging from 100MB to 50GB. The requests will range from 128 KB to 4MB. The profile will be 50:50 concerning read and write with a tendency of a little more reads.

What Filesystem can handle this load best? I've for now opted for XFS. But what tuneables should I look into?

Thanks

You mean except getting a SSD? The file system makes a LOT less selection here than the SSD would ever do. — TomTom, Mar 04 '11 at 15:24
That would be kind of a problem, when the storage is a 14TB RAID ;-) — leto, Mar 04 '11 at 15:33

score 6 · Answer 1 · answered Mar 04 '11 at 18:21

The requirements and constraints:

50:50 read:write ratio
Files being written will range from way larger than the block size to vastly larger than the block size.
Individual requests will range from 128KB to 4MB
On Linux
The file-system will be pretty large, at 14TB.

Unknowns that would help:

Whether or not the random I/O is within files, or is purely based on whole files being read and written in 128KB-4MB chunks
The frequency of file updates.
Concurrency: The frequency of parallel read/write operations (I/O ops).

Sequential I/O

If the 50:50 ratio is represented by reading and writing whole files, and pretty big files at that, then your access patterns are more sequential than random as far as a filesystem is concerned. Use an extent-based filesystem to increase sequentiality in your filesystem for best performance. Since the files are so large, read-ahead will provide significant performance boosts if supported by hardware (some RAID controllers provide this).

Random I/O

This changes if you're planning on doing the read/write activities simultaneously, at which point it does become significantly random. The same applies if you're holding a large number of files open and reading/writing small portions within those files as if it were a database.

One of the biggest misconceptions I run into is the idea that a defragged filesystem performs better than a fragmented one when handling highly random I/O. This is only true in filesystems where the metadata operations suffer greatly on a fragmented filesystem. For very high levels of fragmentation extent-based filesystems can actually suffer more performance degradation than other styles of block management.

That said, this problem only becomes apparent when the I/O access patterns and rate are pushing the disks to their maximum capabilities. With 14TB in the filesystem that means between 7 and 50 spindles in the actual storage array, which yields a vast range of capabilities; ranging from 630 I/O Ops for 7x 2TB 7.2K RPM drives to 9000 I/O Ops for 50x 300GB 15K RPM drives. The 7.2K RPM RAID array will hit I/O saturation a lot faster than the 15K RPM RAID array would.

If your I/O operations rate is not pushing your storage limits, the choice of file-system should be based more on overall management flexibility than tweaking the last few percentage points of performance.

However, if your I/O actually IS running your storage flat out, that's when the tweaking starts becoming needed.

XFS:

Mount: Set 'allocsize' to no larger than 65536 (64MB), but do set it high. This improves metadata speed for file accesses.
Mount: Set 'sunit' to the stripe-size of your RAID array. Can also be set at format time.
Mount: Set 'swidth' to the number of drives in your RAID array (or N-1 for R5, N-2 for R6). Can also be set at format time.
Format: If you really need that last percentage point, put the filesystem log on a completely separate storage device -l logdev=/dev/sdc3

EXT4:

Format: -E stride set to the number of blocks (either 512b or 4K depending on the drive) on a single disk-stripe in the RAID.
Format: -E stripe-width set as 'swidth' in XFS
Format: As with XFS the last percentage point of performance can be squeezed out by placing the journal on a completely separate storage device. -O journal_dev /dev/sdc3/

EXT4's stride size is not based on the disk block size, it's based on the filesystem block size. Also, EXT4's stripe width is not the number of data disks, it's the filesystem stride size multiplied by the number of data disks. For example, consider a 9-disk RAID5 array where the RAID has a 64KB chunk size and you plan to use EXT4's default filesystem block size of 4KB. The EXT4 stride size would be 64KB divided by 4KB, or 16. The EXT4 stripe width would be 16 multipled by 8, or 128. — sciurus, Mar 04 '11 at 19:42

score 0 · Answer 2 · answered Mar 04 '11 at 17:16

I think the real problem here is not just the filesystem, but the parameters settting you use with the filesystem. One thing that might affect is likely read ahead size.

But, OK let's just talk about name. Besides XFS, I think ext4 will suit your need. Bottom line is, I think you need extent based filesystem to avoid fragmentation as much as possible. Both XFS and ext4 support delayed write IIRC, so both might help you to increase chance to do write merge too.

regards,

Mulyadi.

score 0 · Answer 3 · answered Mar 04 '11 at 17:25

0

Given the scale of data you have, I think you want to look at a network cluster filesystem, like Lustre, or IBM's proprietary GPFS. These are designed to give high-performance results under demanding workloads like yours.

answered Mar 04 '11 at 17:25

mattdm

6,550
1
25
48

Filesystem/Options for intensive random I/O

3 Answers3

Sequential I/O

Random I/O