32

OK, not so large but I need to use something where around 60,000 files with average size of 30kb are stored in a single directory (this is a requirement so can't simply break into sub-directories with smaller number of files).

The files will be accessed randomly, but once created there will be no writes to the same filesystem. I'm currently using Ext3 but finding it very slow. Any suggestions?

voretaq7
  • 79,345
  • 17
  • 128
  • 213
bugmenot77
  • 403
  • 1
  • 5
  • 7

12 Answers12

14

One billion files on Linux

The author of this article digs into some of the performance issues on file systems with large files counts and does some nice comparisons of the performance of various file systems ext3, ext4 and XFS. This is made available as a slide show. https://events.static.linuxfound.org/slides/2010/linuxcon2010_wheeler.pdf

time to run mkfs time to create 1M 50kb files File system repair time removing 1m files

nelaaro
  • 584
  • 4
  • 9
  • 25
  • 2
    We really do prefer that answers contain content not pointers to content. Whilst this may theoretically answer the question, [it would be preferable](http://meta.stackexchange.com/q/8259) to include the essential parts of the answer here, and provide the link for reference. – user9517 Aug 27 '12 at 11:55
  • @Iain I hope that is better, as simply downloading the PDF, would give you the same info. – nelaaro Aug 27 '12 at 12:25
  • 32
    wow these are some exceptionally hard to read graphs.~ – ThorSummoner Nov 05 '15 at 22:18
14

You should consider XFS. It supports a very large number of files both at the filesystem and at the directory level, and the performance remains relatively consistent even with a large number of entries due to the B+ tree data structures.

There's a page on their wiki to a large number of papers and publications that detail the design. I recommend you give it a try and benchmark it against your current solution.

Kamil Kisiel
  • 11,946
  • 7
  • 46
  • 68
8

Many files in a directory on ext3 has been discussed in length at the sister site stackoverflow.com

In my opinion 60 000 files in one directory on ext3 is far from ideal but depending on your other requirements it might be good enough.

Ludwig Weinzierl
  • 1,170
  • 1
  • 11
  • 22
6

OK. I did some preliminary testing using ReiserFS, XFS, JFS, Ext3 (dir_hash enabled) and Ext4dev (2.6.26 kernel). My first impression was that all were fast enough (on my beefy workstation) - it turns out that the remote production machine has a fairly slow processor.

I experienced some weirdness with ReiserFS even on initial testing so ruled that out. It seems that JFS has 33% less CPU requirement than all the others and so will test that out on the remote server. If it performs well enough, I'll use that.

bugmenot77
  • 403
  • 1
  • 5
  • 7
4

I'm writing an application that also stores lots and lots of files although mine are bigger and I have 10 million of them that I'll be splitting across multiple directories.

ext3 is slow mainly because of the default "linked list" implementation. So if you have lots of files in one directory, it means opening or creating another is going to get slower and slower. There is something called an htree index that is available for ext3 that reportedly improves things greatly. But, it's only available on filesystem creation. See here: http://lonesysadmin.net/2007/08/17/use-dir_index-for-your-new-ext3-filesystems/

Since you're going to have to rebuild the filesystem anyway and due to the ext3 limitations, my recommendation is that you look at using ext4 (or XFS). I think ext4 is a little faster with smaller files and has quicker rebuilds. Htree index is default on ext4 as far as I'm aware. I don't really have any experience with JFS or Reiser but I have heard people recommend that before.

In reality, I'd probably test several filesystems. Why not try ext4, xfs & jfs and see which one gives the best overall performance?

Something that a developer told me that can speed things up in the application code is not to do a "stat + open" call but rather "open + fstat". The first is significantly slower than the second. Not sure if you have any control or influence over that.

See my post here on stackoverflow. Storing & accessing up to 10 million files in Linux there are some very useful answers and links there.

hookenz
  • 14,132
  • 22
  • 86
  • 142
4

Using tune2fs to enable dir_index might help. To see if it is enabled:

sudo tune2fs -l /dev/sda1 | grep dir_index

If it is not enabled:

sudo umount /dev/sda1   
sudo tune2fs -O dir_index /dev/sad1
sudo e2fsck -D /dev/sda1
sudo mount /dev/sda1

But I have a feeling you might be going down the wrong path... why not generate a flat index and use some code to select randomly based on that. You can then use sub directories for a more optimized tree structure.

Kyle Brandt
  • 82,107
  • 71
  • 302
  • 444
2

ext3 and below support up to 32768 files per directory. ext4 supports up to 65536 in the actual count of files, but will allow you to have more (it just won't store them in the directory, which doesn't matter for most user purposes).

Also, the way directories are stored on ext* filesystems is essentially as one big list. On the more modern filesystems (Reiser, XFS, JFS) they are stored as B-trees, which are much more efficient for large sets.

koenigdmj
  • 1,055
  • 7
  • 12
  • 2
    supporting that number of files in a dir is not the same thing as doing it at a reasonable speed. i don't know yet whether ext4 is any better, but ext3 slows down greatly when it has more than a few thousand files in a directory, even with dir_index turned on (it helps, but doesn't eliminate the problem entirely). – cas Jul 20 '09 at 21:50
1

You can store file inodes instead of filenames: accessing inode numbers should be much faster that resolving file names

kolypto
  • 10,738
  • 12
  • 51
  • 66
0

BTRFS would be very practical. The problem here seems to be small files. NVME and SSD has 4K blocks and will be more than suitable for that file size and very fast accessing small files. 30Kb*60000 files is a total of 1.7 GB average its not even in terabyte scale. So I recommend using a ramdisk with UPS and syncing it to a nvme every 10 seconds with rsync. It only syncs changed files. Keep 100 versions or so rebalance after restarts. Sync to a separate backup in every 1 hours.

Remember that BTRFS wastes a lot of space (%70) with small files but space is not something you need to worry about.

Note that, I wrote this without inspecting first answer with graphs in depth. After checking it out, it confirms my logic.

Gediz GÜRSU
  • 452
  • 3
  • 7
0

You dont want to cram that many files in one directory, you want some sort of structure. Even if it's something as simple as haveing subdirectories that start with the first character of the file can improve your access times. Another silly trick I like to use, is to force the system to update it's cache with metainformation is to run updatedb regularly. In one window run slabtop, and in another run updatedb and you'll see much memory is going to get allocated to caching. It's much faster this way.

Marcin
  • 2,281
  • 1
  • 16
  • 14
-1

You didn't specify the kind of data in these files. But from the sounds of it, you should be using some sort of database with indexing for quick searches.

xeon
  • 3,796
  • 17
  • 18
-1

Filesystem is probably not the ideal storage for such requirement. Some kind of database storage is better. Still if you cant help it, then try splitting files in several directories and use unionfs to mount(bind) those directories on single directory where you want all files to appear. I have not used this technique for speed up at all, but it worth a try.

Saurabh Barjatiya
  • 4,643
  • 2
  • 29
  • 34