Performance associated with storing millions of files on NTFS

Question

Does anyone have a method / formula, etc that I could use - hopefully based on both current and projected numbers of files - to project the 'right' length of the split and the number of nested folders?

Please note that although similar it isn't quite the same as Storing a million images in the filesystem. I'm looking for a way to help make the theories outlined more generic.

Assumptions

I have 'some' initial number of files. This number would be arbitrary but large. Say 500k to 10m+.
I have considered the underlying physical hardware disk IO requirements that would be necessary to support such an endeavor.

Put another way

As time progresses this store will grow. I want to have the best balance of current performance and as my needs increase. Say I double or triple my storage. I need to be able to address both current needs and projected future growth. I need to both plan ahead and not sacrifice too much of current performance.

What I've come up with

I'm already thinking about using a hash split every so many characters to split things out across multiple directories and keeping the trees even, very similar as outlined in the comments in the question above. It also avoids duplicate files, which would be critical over time.

I'm sure that the initial folder structure would be different based on what I've outlined, and depending on the initial scale. As far as I can figure there isn't a one size fits all solution here. It would be horrendously time intensive to work something out experimentally.

Related: http://stackoverflow.com/questions/197162/ntfs-performance-and-large-volumes-of-files-and-directories — Greg Askew, Aug 20 '14 at 20:23
For the love of all things holy, please remember to disable 8.3 file names before you start! — Ryan Ries, Aug 21 '14 at 01:52
is it worm storage? files written once and then never changed/removed. How many Millions? — Mathieu Chateau, Aug 21 '14 at 05:43

score 2 · Accepted Answer · edited May 23 '17 at 11:33

Some years ago I started writing a storage system similar to ceph. Then I discovered ceph and what they had worked better so I dumped my development.

During the development process I asked a similar question to yours but on SA I did a lot of calculation on handling lots of small files and found that naming files (assuming they can be anything) by uuid and splitting it 3 levels deep was ample for my needs.

From memory I used the first 3 letters to form the top level, then the next 3 to form level 2 and then used the whole uuid for the file name.

My calculation was based on the number of files I wanted and amount of data per drive stored and what the limits were for the filesystem type.

For a UUID, if you use the hex version you get A-Z, a-z, 0-9 so 26+26+9 or 61. For 3 levels deep that is 61*61*61 = 226,981. I figured 226k directory combinations is ample. For XFS this is fine. But for NTFS I'm not sure. So you had better find out what the real limits are. Just listing that many directories by opening up explorer might cause your server to grind somewhat. So you may want to come up with a scheme that doesn't have as many folders at the top level. Perhaps using a single letter and go 4 levels deep or something.

Do note that in HEX, you have "a-f" and "0-9". So each character can represent one of 16 possiblitites. Three directory levels of 1 char each (0/1/2/012-actualfilename) gives 16*16*16 = 4.096 directories. Three levels of 2 chars each (01/23/45/012345-acutalfilename) gives 256 * 256 * 256 = 16.777.216 possibilities. — Michael Bisbjerg, Jun 04 '16 at 12:34
oops, thanks for pointing that out @MichaelBisbjerg. You're right. — hookenz, Jun 07 '16 at 00:00

score 1 · Answer 2 · edited Apr 07 '16 at 09:50

You don't provide the Windows version that you will use. I really recommend using 2012 R2 to get all new feature from NTFS, like hot repair.

Your 3 nightmares will be:

Fragmentation
Time taken to do a chkdsk. The time of it is based on number of files, not size.
Backup time

If you are at least on Windows 2012, you should look at ReFS. This new file system has what you want: http://msdn.microsoft.com/en-us/library/windows/desktop/hh848060(v=vs.85).aspx

ReFS issue you may have: managing security and backup software.

If you stick with NTFS, i would split data across a lot of NTFS drives (using mount point), and use DFS to access them (and so to link one root folder to a different drive, and later to a different server to spread).

You should look for a defrag software, like o&o, which goes much much further than the windows one. Start defrag since beginning, and as often as possible.

You will want plenty of RAM to get files cached (if access more than once in a while).

Performance associated with storing millions of files on NTFS

2 Answers2