Optimizing storage of 10-20 million files on RAID5 (currently using LVM+XFS)

Question

Although I've browsed some of the questions here, I think every situation is different and maybe requires a totally different solution.

What I have now:

Linux software RAID5 on 4x4TB enterprise HDD
LVM on top with a few volumes
The most important, storage volume, a 10TB XFS
All setup with default parameters in Debian Wheezy
The volume is mounted with options 'noatime,nodiratime,allocsize=2m'
About 8GB RAM free and used for caching I guess, Quad core Intel CPU with HT not very used

This volume mostly stores about 10 million files (at most 20M in the future) between 100K and 2M. This is a more precise distribution of file size ranges (in K) and numbers in range:

       4    6162
       8      32
      32      55
      64   11577
     128    7700
     256    7610
     512     555
    1024    5876
    2048    1841
    4096   12251
    8192    4981
   16384    8255
   32768   20068
   65536   35464
  131072  591115
  262144 3411530
  524288 4818746
 1048576  413779
 2097152   20333
 4194304      72
 8388608      43
16777216      21

The files are mostly stored at level 7 on the volume, something like:

/volume/data/customer/year/month/day/variant/file

There are usually ~1-2K files inside those folders, sometimes less, other times up to 5-10K (rare cases).

I/O isn't so heavy but I experience hangs when pushing it a little bit more. For example:

Application that performs most I/O is NGINX for both reading and writing
There are some random reads of 1-2MB/s TOTAL
I have some folders where data is continuously written at a rate of 1-2MB/s TOTAL and all files older than 1h should be periodically removed from the folders

Running the following cron once per hour hangs for a few good seconds the entire server many times and may even disrupt the service (the writing of new data) as I/O timeouts are generated:

find /volume/data/customer/ -type f -iname "*.ext" -mmin +60 -delete
find /volume/data/customer -type d -empty -delete

I also observe slow writing speeds (few MB/s) when writing files in the above ranges. When writing larger files, it goes OK until write cache fills (obviously) and then speed drops and starts hanging the server in waves.

Now, I am searching for a solution to optimize my storage performance as I am sure that I am not optimal at defaults and many things may be improved. Although not that useful for me, i wouldn't drop LVM if it it doesn't provide significant gain also because although possible, I wouldn't reinstall the whole server by dropping LVM.

Read a lot about XFS vs. ReiserFS vs. Ext4 but I am quite puzzled. Other of my servers in a much smaller RAID1 2TB volume but exactly same setup and significantly heavier workload perform quite flawlessly.

Any ideas?

How should I debug/experiment?

Thanks.

I would look at an alternative to **find** as the way you're using it looks I/O intensive, as you say, it's the find cron that hangs the system - I think find does not run parallel process, you could pipe the arguments of find to **xargs** and call **rm** . xargs can spawn multiple processes, have a look here: [link]http://serverfault.com/questions/193319/a-better-unix-find-with-parallel-processing — Sum1sAdmin, May 16 '16 at 12:25
Do you feel safe mounting with the `nobarrier` option as well? — ewwhite, May 16 '16 at 12:30
@ewwhite - I'd avoid `nobarrier` on XFS. It's been a while since I dug into it, but I strongly suspect XFS relies significantly more on barriers than ext4 does. I know that XFS received a bad reputation for data corruption on Linux a few years back, as that reliance combined with LVM's previously ignoring barriers to make XFS unreliable. See http://lwn.net/Articles/283168/ (among many others from 8-10 years ago). — Andrew Henle, May 16 '16 at 14:24
What other RAID configurations have you tried? How does RAID 1+0 perform? — Andrew Henle, May 16 '16 at 14:25
OP is using software RAID, so there's no stable write cache. Shame. But on regular enterprise operating systems and quality server hardware, it's okay and recommended to mount `nobarrier`. — ewwhite, May 16 '16 at 14:25
I only tried RAID5 as this is more of a storage server and storage size it's the most important. However, I suspect many things can still be optimized, hence, the question. — bigfailure, May 16 '16 at 15:00
@Sum1sAdmin: thanks for the idea, I'll try, but still, why is so complicated in my setup to find-delete about 2-3K files inside a few folders with a search range for find of only about 10K files max? Why those hangs happen in the first place even if I wrap my find-delete commands in a shell script started with ionice -c 3? Shouldn't they only act in the background? — bigfailure, May 16 '16 at 15:07
@ewwhite: would `nobarrier` really help regarding performance? In case of unlikely power failure or server crash, what data exactly may be affected? — bigfailure, May 16 '16 at 15:09
@bigfailure RAID5 on slow SATA drives is likely a poor choice given your usage pattern. RAID5 handles large, sequential writes best, and it doesn't handle small, random write operations well at all. When you delete thousands of files at at time, you're making multiple small, random writes to the underlying disk storage. That causes significant IO amplification because of the read-modify-write operation necessary for RAID5 to handle a small write operation. With LVM RAID5 and a default 64kB stripe size, a 4kB write will cause a 64kB read then a 64kB write - plus parity bits. (approximately) — Andrew Henle, May 16 '16 at 15:47
Also, `find /volume/data/customer -type d -empty -delete` could in theory delete a directory immediately after it's created. I suspect that's not a desired behavior. — Andrew Henle, May 16 '16 at 16:13
@AndrewHenle: well, in recent Linux versions the default RAID5 stripe size is 512K not 64K and indeed this may add up to a lot of R/W. However, this deletion case is actually not such a big problem if alternatives exist. But does this mean that generally such a large stripe size creates tons of overhead also for file creation? What about appending data to a file? — bigfailure, May 16 '16 at 16:44
@bigfailure In general, larger stripe size on RAID5 is great for large IO operations to the file systems and bad for small IO operations. Creating a file is usually just a handful of IO operations, so that's not too much of a concern. Appending to a file is usually not a concern because the page cache will generally be used to coalesce many small writes into fewer large write operations to the disk, although once again, if you do a lot of small append operations to a lot of files you don't get the benefits of write coalescing and you wind up with a lot of file mod time updates. — Andrew Henle, May 16 '16 at 22:39
(cont) I'd venture to guess, though, that from the symptoms you're describing the large number of delete operations your cron job initiates results in a relative large number of small random writes to the RAID volume, which get amplified through read-modify-write and cause the entire file system to back up. You might want to try running the delete script more often so you don't accumulate as many files to delete all at once. — Andrew Henle, May 16 '16 at 22:43

score 1 · Answer 1 · answered May 16 '16 at 15:02

1

First, XFS is the right choice for this kind if scenario: with XFS is almost impossible to go out of inodes.

To increase your delete performance, try the following:

use the deadline I/O scheduler rather the the (default) cfq
use logbsize=256k,allocsize=64k as mount options (in addition to nodiratime,noatime)

To lower the deletes impact on other system activity, try running your find script using ionice -c 2 -n 7

Report back on your results!

answered May 16 '16 at 15:02

shodanshok

44,038
6
98
162

(and nobarrier) ;) – ewwhite May 16 '16 at 15:11
1

Without a power protected cache, `nobarrier` is not without significant risk. So, for software RAID, I'll avoid it. – shodanshok May 16 '16 at 15:18
So bad performance 99% of the time for an unlikely condition? – ewwhite May 16 '16 at 15:19
@ewwhite *So bad performance 99% of the time for an unlikely condition?* The entire purpose of using RAID is to ensure data availability. Why risk significant down time and loss of access to that data just for performance when the fix for poor RAID5 performance is a simple "Buy more and/or faster disks." or "Use a faster RAID configuration". – Andrew Henle May 16 '16 at 15:29
I have `find` commands placed inside a script setup in cron with `ionice -c 3 /path/to/script.sh` already. Isn't this quite the same? I have already setup `allocsize=2m` at mount. Is this too big, and why? However, I don't know anything about `logbsize` so maybe I should try. – bigfailure May 16 '16 at 15:30
I wouldn't `ionice` these scripts, you don't want the IO ops to backup - just `nice` the heck out of them so they schedule less frequently on the CPU and hence create I/O operations at bigger gaps when the system is busy. And use lock files to prevent concurrency. – symcbean May 16 '16 at 15:44
@ewwhite It really depends on how severe can be the dataloss risk associated with disabled barriers. The problem is that in some (albeit rare) circumstances, a missed barrier/sync [can wreack havoc not only on the latest data, but on the filesystem itself](http://xfs.org/index.php/XFS_FAQ#Q:_What_is_the_problem_with_the_write_cache_on_journaled_filesystems.3F). Ext4 and XFS added some checkpointing to greatily reduce the odds of entire filesystems loss, but AFAIK they can not entirely avoid the risk. In some cases, the risk can be taken - but not in most other scenarios. – shodanshok May 16 '16 at 16:42
@bigfailure `allocsize` define the minimun allocation size the filesystem will use for each file - even smaller ones. As you have many small files, try a smaller allocsize - 64k often is the better choice. Alternatively, you can entirely omit this parameter and let XFS heuristic do it magic. Try both approaches, timing them. About `ionice`: can you try to `ionice` directly the `find` commands? Something similar to `ionice -c 2 -n 7 find ...` Does it change anything? – shodanshok May 16 '16 at 16:46

score 1 · Answer 2 · answered May 16 '16 at 15:37

Agree with Shodanshok that deadline is probably a good idea. Far from convinced that you should be using XFS here.

find /volume/data/customer/ -type f -iname "*.ext" -mmin +60 -delete

XFS used to be very bad with deleting files - I'm told that most of the bugs in this area have been resolved but not done any hard benchmarking to confirm this.

it goes OK until write cache fills (obviously) and then speed drops and starts hanging the server in waves

Hanging? sounds like you should be adjusting your dirty page ratios (decrease background raio, increase blocking ratio) you should also change the dirty_expire_centisecs (up or down - see what makes it faster!) and decrease the dirty_writeback_centisecs if the overall load and CPU usage is acceptable.

If the 'find' statements are processing the bulk of the data, then tweaking the vfs_cache_pressure would be a good idea. Again, the only way to find out the right value is by trial and error, but with a very high fanout and presumably not a lot of reading of the data files, then decreasing it should improve the cache effectiveness.

Note that LVM snapshots will kill the IO throughput.

---- the stuff above applies regardless of the filessystem you choose ----

The most important consideration when you choose a filesystem is how robust you need it to be. If these are all temporary files, and you don't mind loosing them all in the even of an outage / don't need fast recovery times after an outage then you shouldn't be using a journalling filesystem at all. But you've not told us much about the data.

Noting the high fanout....the dir_index feature of ext3/4 was explicitly added to give faster, more efficient resolution when a directory contains large numbers of files / high turnover of files. I don't know how effective XFS is in this scenario.

ReiserFS is not very well supported anymore.

There are several other things you might want to look at (including UPS, bcache, dedicated journal devices) but then I wouldn't have an excuse to plug a book on the subject.

It is true that, in single threaded workloads, Ext4 remains about [2x faster then XFS](https://lwn.net/Articles/476263/) during heavy metadata operations. However, with Ext3/4 going out of inodes is a real possibility - and a bad one. XFS is generally more versatile when dealing with millions of files. — shodanshok, May 16 '16 at 16:54

Optimizing storage of 10-20 million files on RAID5 (currently using LVM+XFS)

2 Answers2