What are the performance implications for millions of files in a modern file system?

Question

Let's say we're using ext4 (with dir_index enabled) to host around 3M files (with an average of 750KB size) and we need to decide what folder scheme we're going to use.

In the first solution, we apply a hash function to the file and use two levels folder (being 1 character for the first level and 2 characters to second level): therefore being the filex.for hash equals to abcde1234, we'll store it on /path/a/bc/abcde1234-filex.for.

In the second solution, we apply a hash function to the file and use two levels folder (being 2 characters for the first level and 2 characters to second level): therefore being the filex.for hash equals to abcde1234, we'll store it on /path/ab/de/abcde1234-filex.for.

For the first solution we'll have the following scheme /path/[16 folders]/[256 folders] with an average of 732 files per folder (the last folder, where the file will reside).

While on the second solution we'll have /path/[256 folders]/[256 folders] with an average of 45 files per folder.

Considering we're going to write/unlink/read files (but mostly read) from this scheme a lot (basically the nginx caching system), does it maters, in a performance sense, if we chose one or other solution?

Also, what are the tools we could use to check/test this setup?

Obviously benchmarking will help. But ext4 may be the wrong filesystem for this. I'd be looking at XFS. — ewwhite, Aug 13 '16 at 17:12
I wouldn't just _look at_ XFS, I'd immediately use it without further ado. The B+tree beats the hash table every time. — Michael Hampton, Aug 13 '16 at 18:27
Thanks for the tips, benchmarking is a little bit hard though, I tried to `hdparm -Tt /dev/hdX` but it might not be the most appropriated tool. — Leandro Moreira, Aug 13 '16 at 19:32
No `hdparm` is not the right tool, it is a check of raw performance of the block device and not a test of the file system. — HBruijn, Aug 14 '16 at 06:54

Michael Hampton · Accepted Answer · 2016-08-14T01:53:09.727

28

The reason one would create this sort of directory structure is that filesystems must locate a file within a directory, and the larger the directory is, the slower that operation.

How much slower depends on the filesystem design.

The ext4 filesystem uses a B-tree to store directory entries. A lookup on this table is expected to take O(log n) time, which most of the time is less than the naive linear table that ext3 and previous filesystems used (and when it isn't, the directory is too small for it to really matter).

The XFS filesystem uses a B+tree instead. The advantage of this over a hash table or B-tree is that any node may have multiple children b, where in XFS b varies and can be as high as 254 (or 19 for the root node; and these numbers may be out of date). This gives you a time complexity of O(log_b n), a vast improvement.

Either of these filesystems can handle tens of thousands of files in a single directory, with XFS being significantly faster than ext4 on a directory with the same number of inodes. But you probably don't want a single directory with 3M inodes, as even with a B+tree the lookup can take some time. This is what led to creating directories in this manner in the first place.

As for your proposed structures, the first option you gave is exactly what is shown in nginx examples. It will perform well on either filesystem, though XFS will still have a bit of an advantage. The second option may perform slightly better or slightly worse, but it will probably be pretty close, even on benchmarks.

edited Aug 14 '16 at 01:53

answered Aug 13 '16 at 22:52

Michael Hampton

237,123
42
477
940

And for either XFS or ext4, the hardware you put the filesystem on will have a huge impact on performance. A slow 5400-rpm SATA drive can do about 50 random IO operations/sec, a good 15,000-rpm SAS drive can do a few hundred, and an SSD will likely be bandwidth-limited and might get a few million random IO operations/sec if not more. – Andrew Henle Aug 14 '16 at 16:37
1

Strictly speaking, $O(\log_b n)$ for fixed $b$ is the same complexity as $O(\log n)$. But to the OP, the actual constants would matter. – Hagen von Eitzen Aug 14 '16 at 17:30
Unless there's something wrong with my filesystem, ext4 cannot handle 10,000 files in a single directory. Doing a simple `ls -l` takes a full minute if the directory has dropped off the inode cache. And when it is cached, it still takes over a second. This is with an SSD and a Xeon with tons of RAM on a fairly low traffic web server. – Abhi Beckert Nov 05 '18 at 05:06
@AbhiBeckert Was it upgraded from ext3? If so, try making a new directory and move the files to it. – Michael Hampton Nov 05 '18 at 12:32
@Hampton No. it’s a (fairly) recently setup server on modern hardware. I’ve been working on the issue with our sysadmin/data centre for a couple months. We’re paying thousands of dollars per month to lease the server and not getting acceptable performance out of it. It’s looking like the only option is to move to a new directory structure - perhaps using hashes instead of dates for filenames to spread it out more evenly. – Abhi Beckert Nov 13 '18 at 03:01
@AbhiBeckert I can only suggest that you switch to XFS and adopt some sort of directory structure as described above. – Michael Hampton Nov 13 '18 at 03:18
@MichaelHampton it looks like the issue I'm having is due to Ext's directory database not completely being pruned after a child is deleted (unless you run a slow operation that requires unmounting the disk). The directory in question doesn't have many files in it now, but it may have historically. I haven't been able to find a good description of the issue, only vague references to it unfortunately. – Abhi Beckert Nov 16 '18 at 00:42
So to look up a file by name, in either option the fs will have to perform 3 lookups, 2 of those on very shallow trees and one in a deep tree. If all files are in one directory, the fs has to perform only 1 lookup, in a slightly deeper tree. I don't think it's clear-cut which is faster. – w00t Oct 07 '19 at 08:10
Also note that the filesystem may have totally different performance for huge amount of *files* in a single directory vs a huge amount of *(sub)directories* in a single directory. This is because directory contains technically a hardlink from entry ".." to parent directory which files don't need. As a result, if you have huge amount of subdirectories, the total amount of hardlinks to directory is count of subdirectories + 1. That said, we use similar fan-out setup and we use 256 directories per level. Seems to work just fine for over 10 million files. – Mikko Rantalainen Feb 05 '21 at 14:30
@AbhiBeckert "The directory in question doesn't have many files in it now, but it may have historically" - here is a ZFS issue that says that ext4 has the same issue that after a directory has many files in it, working with that directory _stays_ slow: https://github.com/openzfs/zfs/issues/4933 – nh2 Aug 07 '22 at 14:05

score 5 · Answer 2 · answered Aug 13 '16 at 22:43

In my experience, one of the scaling factors is the size of the inodes given a hash-name partitioning strategy.

Both of your proposed options creates up to three inode entries for each created file. Also, 732 files will create an inode that is still less than the usual 16KB. To me, this means either option will perform the same.

I applaud you on your short hash; previous systems I've worked on have taken the sha1sum of the given file and spliced directories based on that string, a much harder problem.

What makes the use of SHA1 sums (and other, longer hash sums) "a much harder problem"? It's unwieldy for human users, yes, but it's all the same to the OS, file system, and other programs. — kbolino, Sep 18 '17 at 16:51

score 4 · Answer 3 · answered Aug 13 '16 at 22:28

Certainly either option will help reduce the number of files in a directory to something that seems reasonable, for xfs or ext4 or whatever file system. It is not obvious which is better, would have to test to tell.

Benchmark with your application simulating something like the real workload is ideal. Otherwise, come up with something that simulates many small files specifically. Speaking of that, here's an open source one called smallfile. Its documentation references some other tools.

hdparm doing sustained I/O isn't as useful. It won't show the many small I/Os or giant directory entries associated with very many files.

score 1 · Answer 4 · answered Aug 13 '16 at 23:14

One of the issues is the way to scan the folder.

Imagine Java method which runs scan on the folder.

It will have to allocate large amount of memory and deallocate it in short period of time which is very heavy for the JVM.

The best way is to arrange the folder structure the way that each file is in dedicated folder e.g. year/month/day.

The way full scan is done is that for each folder there's one run of the function so JVM will exit the function, deallocate RAM and run it again on another folder.

This is just example but anyway having such huge folder makes no sense.

You're assuming Java and scanning the folder. Neither is mentioned.in the question, and there are other ways to process the folder in Java besides scanning it. — user207421, Aug 14 '16 at 01:48

score 1 · Answer 5 · answered Dec 22 '18 at 03:49

1

I've been having the same issue. Trying to store millions of files in a Ubuntu server in ext4. Ended running my own benchmarks. Found out that flat directory performs way better while being way simpler to use:

Wrote an article.

answered Dec 22 '18 at 03:49

Hartator

135
5

2

That is definitely not the expected result. Before you go with this or recommend it, you should look deeper into why you got this unexpected result. – Michael Hampton Dec 22 '18 at 04:14
Yeah, there's definitely a bug somewhere here, this is simply impossible. Also, the "read" for "deep directory structure" is so obviously unrealistic. – Íhor Mé Nov 30 '20 at 19:45
I agree. We need more information what was exactly benchmarked here. – Mikko Rantalainen Feb 05 '21 at 14:35
I do find this behaviour pretty reasonable for a b-tree algorithm. It works always bad on small data sets and very well on millions. Remember you access another directory, you always have two extra disk accesses (inode + toplevel btree). If you go down into a 2 level hierarchy it would be much slower. – Lothar Nov 14 '21 at 00:57
The linked article was updated; the original benchmark was flawed because the files weren't actually written. There are now more graphs in the article some of which find deep directory structures to be faster. – nh2 Aug 07 '22 at 14:08

What are the performance implications for millions of files in a modern file system?

5 Answers5

Linked