22

Greetings,

I'm writing some scripts to process images from various photo websites. Right now I'm storing all this data in individual text files in the same directory.

The directory is web accessible. An end user makes a call to a web service which returns the path to the file the user will need.

I was wondering at what stage would I see a performance impact by having all these files in the same directory? (If any)

steve
  • 505
  • 1
  • 6
  • 10

7 Answers7

18

Performance varies according the the filesystem you're using.

  • FAT: forget it :) (ok, I think the limit is 512 files per directory)
  • NTFS: Althought it can hold 4billion files per folder, it degrades relatively quickly - around a thousand you will start to notice performance issues, several thousand and you'll see explorer appear to hang for quite a while.
  • EXT3: physical limit is 32,000 files, but perf suffers after several thousand files too.

  • EXT4: theoretically limitless

  • ReiserFS, XFS, JFS, BTRFS: these are the good ones for lots of files in a directory as they're more modern and designed to handle many files (the others were designed back in the days when HDDs were measured in MB not GB). Performance is a lot better for lots of files (along with ext4) as they both use a binary search type algorithm for getting the file you want (the others use a more linear one).

Michael Hampton
  • 237,123
  • 42
  • 477
  • 940
gbjbaanb
  • 3,852
  • 1
  • 22
  • 27
  • 7
    This is wrong. There isn't a limit of 32000 files in EXT3. There's a limit of 32000 subdirectories. I've got a directory here with over 300000 files and it performs fine. – davidsheldon Dec 31 '09 at 10:39
  • 1
    quite true - the file limit is the entire filesystem's limit on inodes, but you're limited to 32k links (ie subdirs). – gbjbaanb Jan 01 '10 at 17:16
  • The statement for current NTFS is also not true, it can hold up to 4,294,967,295 (2^32 - 1): http://technet.microsoft.com/en-us/library/cc781134%28WS.10%29.aspx – Fleshgrinder Jun 18 '13 at 16:22
  • Do NOT confuse sub-directories with files, on CentOS machine I had 32000 sub-directories, reached the limit, I moved all FILES in that one directory and still works fine. – adrianTNT Aug 01 '13 at 19:04
  • Some [numbers for MacOS here](https://superuser.com/questions/845143/any-limitation-for-having-many-files-in-a-directory-in-mac-os-x) – smci Nov 20 '17 at 23:27
10

I store images for serving by a web server, and I have over 300,000 images in one directory on EXT3. I see no performance issues. Before setting this up, I did tests with 500k images in a directory, and randomly accessing files by name, and there was no significant slowdown with 500k over 10k images in the directory.

The only downside I see is that in order to sync the new ones with a second sever I have to run rsync over the whole directory, and can't just tell it to sync a sub directory containing the most recent thousand or so.

davidsheldon
  • 381
  • 2
  • 7
3

The amount of files in a folder could theoretically be limitless. However, every time when the OS will access the specific folder to search for files, it will have to process all files in the folder. With less than 500 files, you might not notice any delays. But when you have tens of thousands of files in a single folder, a simple folder list command (ls or dir) could take way too long. When these folders can be accessed through FTP, it will really be too slow...

Performance issues won't really depend on your OS but on your system processor speed, disk capacities and memory. If you have that many files, you might want to combine them into a single archive, and use an archiving system that is optimized to hold a lot of data. This could be a ZIP file but better yet, store them as blobs in a database with the file name as primary key.

Wim ten Brink
  • 1,045
  • 6
  • 13
  • But will accessing the file directly remove bottlenecks with searching directories or will accessing a directy still have an underlying search call ? (Linux, debian) – steve Dec 30 '09 at 14:49
  • 3
    Accessing the file directly will mitigate these problems. I've done tests on ext3, and accessing a file by name in a directory containing 500000 files is not significantly slower than one containing 1000. Obviously doing an `ls` is a problem. – davidsheldon Dec 31 '09 at 10:33
  • When knowing the exact name, access should be fast. The problem would be mostly any code or command that wants to get a list of files. – Wim ten Brink Jan 01 '10 at 01:09
1

My rule of thumb is to split folders if there are more than 1000 files and the folder will be browsed (i.e. through the internet or Explorer) or 5000 files otherwise.

Beep beep
  • 1,843
  • 2
  • 18
  • 33
0

If you are directly accessing a file the number of files in a directory is no speed problem.

The number of files you can create in a single directory is depended on the file system you are using. If you are listing all files in the directory or searching, sorting, etc. having many files will slow down those operations.

gbjbaanb is wrong in his answer about the maximum file size of ext3. Generally ext limits the number of files on your disc in general. You can't create more files then you have inodes in your inode table. He is correct in suggesting reiserfs for more performance with many files

Janusz
  • 103
  • 4
0

Checked folder with 10K files in NTFS (Windows 7, 64 bit). Folder with 10K images in any view (List, Icon etc.) works and scrolls without any sensible delay.

Vil
  • 101
0

As @skaffman points out, the limits depend on the operating system. You're likely to be affected by limits on older OSes. I remember an old version of Solaris was limited to 32768 files per directory.

The usual solution is to use some sort of hashing, i.e. Cyrus imap server splits users by an alphabetic hash:

/var/spool/imap/a/user/anna/
/var/spool/imap/a/user/albert/
/var/spool/imap/d/user/dan/
/var/spool/imap/e/user/ewan/
diciu
  • 141
  • 2
  • 1
    Thanks, I'd definitly have something in place once a dir has more than 2k files! :) – steve Dec 30 '09 at 14:48
  • This question has some good answers: http://serverfault.com/questions/95444/storing-a-million-images-in-the-filesystem – gm3dmo Dec 30 '09 at 15:26
  • My general rule of thumb is that more than about 20,000 files in a directory is not a good idea. Most modern filesystems do ok with that many files. Once you hit 32k files in a directory some filesystems such as ext3 will start having serious performance issues. – Phil Hollenback Dec 30 '09 at 20:17
  • Phil - do you have any information on the performance issues with over 32k files with ext3, I'm not seeing any at the moment with over 300k Maybe it's something that isn't affecting my pattern of use. – davidsheldon Dec 31 '09 at 10:41
  • At my previous job scientific software would generate lots of small (few k each) files in a directory. We definitely saw that for >32k files directory read times would shoot up hugely. Just running 'ls' on a directory with that many files would take a minute or more. – Phil Hollenback Jan 07 '10 at 05:43
  • Wouldn't /var/spool/imap/user/a, /var/spool/imap/user/b, etc. be better? or is that typo? – Gordon Bell Mar 09 '16 at 04:48