Which directory hierarchy will be the best/fastest?

Question

I have a pretty big directory with many cache files which I want to reorganise for max performance (access times).

2x 2TB SATA III drives, software RAID 1 (mirroring)
OS: Ubuntu 12.04 LTS
filesystem: ext4
500 GB od data
about 16-17 milions files
average file size: 30KB
filenames are MD5 hashes

Files are accessed (randomly) by PHP/Perl scripts. These scripts generates absolute paths and read the file. There is no directory listing: pretty much just fopen with absolute path to file.

Current directory hierarchy is: cacheDir/d4/1d/d41d8cd98f00b204e9800998ecf8427e.dat So there are 256 of 1st level subdirectories (d4 in example), and 256 of 2nd level subdirectories (1d in example). On average, there is about 200-300 files in each 2nd level directory.

Problem: when there is a web traffic peak and a lot of fopen's in cacheDir, the iowait is growing, slowing down the system, cousing very high load and noticable delays. This high load appear only if files in cacheDir are accessed. If I access other dir/files with same frequency, disk and system are doing just fine.

I was wondering if changing cache directory structure would improve performance? Changing to (for example): cacheDir/d/4/1/d/8/d41d8cd98f00b204e9800998ecf8427e.dat (16 subdirectories in: 1st, 2nd, 3rd, 4th level, and (on average) 15 files per 4th level subdir).

I know that Software RAID 1, on a simple, desktop SATA III drive is not a speed monster, but maybe there are some good methods for optimising filesystem ?

Please note:

filesystem has enabled dir-index
filesystem is mounted with noatime
filesystem was optimised with e2fsck -Df

Related: [Our canonical information on Capacity Planning](http://serverfault.com/questions/384686/can-you-help-me-with-my-capacity-planning), and [specific advice on load testing web sites](http://serverfault.com/questions/350454/how-do-you-do-load-testing-and-capacity-planning-for-web-sites) -- largely from a hardware perspective, but instead of adding more RAM/CPUs/Disk you can easily tweak software configuration settings between runs (remember to rebuild the cache!) — voretaq7, May 29 '13 at 18:33

score 5 · Answer 1 · answered May 29 '13 at 17:20

This may sound stupid but the truth (your truth) is benchmark results. There may be file systems which are faster than others in every case but the optimal structure most probably depends on the speed characteristics of your disks and the amount of RAM and the cache effectiveness.

What happens if you use smaller directories with a deeper hierarchy? Less data has to be read to find a directory entry but maybe (if that directory's entry in its parent is not cached any more). Let's assume a directory entry is 50 bytes. That's 15K for the whole directory with 300 files. When doing consecutive reads your disk probably delivers 150+ MiB/s. Thus the difference between reading 300 files or 600 files is 0.1 milliseconds. The positioning time is 4ms at best (if that's not an SSD). I.e. for each saved directory lookup you can read the entries of at least 12.000 files. That makes me assume that your directories are rather small. But maybe all your directory entries are in the cache (I don't know how to monitor that, would be interesting though) so this calculation is irrelevant. Perhaps it helps to keep a script in the background which accesses all directories once every few seconds so that none of them gets thrown out of the cache.

I assume that the problem is not the lookup time for the file inode. Probably a lot of processes try to do I/O simultaneously. If this leads to the files being read in several steps then performance is dead, of course. The same is true for file fragmentation. Have a look at filefrag and your cache files. And have a look at blockdev --setra. You should adjust that to your average file size (or the size which is more than 90% of your files) and check whether this has any influence. I also found the hint (several years old, though) to set this value to zero for all devices except for the topmost:

/dev/sdx -> ra=0
/dev/mdx -> ra=0
/dev/lvm/ -> ra=xxxx

I don't know how much you are willing to do but I can imagine that a FUSE module would help in your case (depending on the file size and read-ahead effectiveness): This module would have to make sure that files are read in one step and that (within the limits of userspace) these accesses are not interrupted. The next step would be sorting file accesses by position on disk, i.e. do on the file level what the kernel (and the disk itself) does with single I/O operations. Instead of having a big file system with directories you could create smaller LVs. Thus you could sort file accesses by name and would get accesses sorted by disk area.

If you are willing to change your hardware then this may be interesting: Putting just the metadata on a SSD. And you should try to get write accesses off your cache disks. This may be mainly log files. They are usually not really important so it may help to put them on a file system with long commit time and data=writeback.

If (some of) your cache data is static (and you don't need ACL) then you may test the performance if you move it from ext4 to squashfs (compressed read-only FS). Even transparent compression (FUSE) within ext4 may help if the problem is reading a file in several steps. The file system (and disk-internal) read-ahead would get more of the file (if it is compressable).

Using FUSE modules may ultimately be self-defeating here - the round trip through userspace will just add more delay time (and context switches) before you can get your data. A more optimized filesystem (ZFS, Reiser) might be more bang for your buck. — voretaq7, May 29 '13 at 18:35
@voretaq7 Of course, FUSE adds some delay. But I think if your system goes down due to I/O then you probably don't care about more CPU load. You can hardly waste CPU load if the CPU is waiting anyway. Furthermore you can probably do 1,000 FUSE roundtrips for every saved disk access; 1,000 per core. Do you know if and how one can trace the number of disk accesses (seeks) per file? — Hauke Laging, May 29 '13 at 18:41
If the system in question supports `dtrace` you can get some pretty detailed stats out of that. That would also be a good way to isolate the performance bottleneck to make sure the solution being pursued is the right one (or at least in the right general area) — voretaq7, May 29 '13 at 19:17

Which directory hierarchy will be the best/fastest?

1 Answers1