10

I have 2x 4TB Disks in hardware RAID1 (it might be a LSI MegaRaid) on Debian Wheezy. The physical block size is 4kB. I'm going to store 150-200 million of small files (between 3 and 10kB). I'm not asking for performance, but for best filesystem and block sizes to save storage. I've copied a file of 8200 byte onto an ext4 with block size of 4kB. This took 32kB of disk!? Is journaling the reason for that? So what options are there to save most storage for such small files?

rabudde
  • 304
  • 4
  • 22
  • 1
    See also: [How do I determine the block size of an ext3 partition on Linux?](http://serverfault.com/questions/29887/how-do-i-determine-the-block-size-of-an-ext3-partition-on-linux) – Chris S Jan 08 '14 at 22:19

1 Answers1

1

If I were in that situation, I'd be looking at a database that can store all the data in a single file with a compact, offset-based index, rather than as separate files. Maybe a database that has a FUSE driver available for interacting with it as files when necessary, without them actually all BEING separate files.

Alternatively, you could look at say, the 60th--70th percentile of file sizes, and try to fit that filesize directly into the filesystem tree nodes, rather than as separate blocks on disk. Storing 10k in each node is probably a big ask, but if you could get 60%-70% of files in there, that would probably be a huge win.

Only certain filesystems can do that at all (reiserfs is one), and I guess it all depends on what size that percentile is, whether it WILL fit in the tree. You may be able to tune it. I guess try to fit the rest into one block.

And don't worry about journals; they have an upper size limit anyway.

  • 4
    No no no no no no no no just... no to your 1st paragraph. I made this mistake years ago and it had to be undone later on. I've also inherited systems that use this design pattern. Files belong in the file system, or as a compromise, in an SQL Server FileStream object if you *must* combine them (so maybe your FUSE driver, but still just no). There are other considerations when working in the filesystem, like don't put 4 million files in one folder (I've also made that mistake). – Mark Henderson Jan 09 '14 at 00:04
  • 2
    @MarkHenderson but the problem is defining what SHOULD be a file, and what should be a record. Without any more details having been provided, hundreds of millions of tiny things sound MUCH more like records to me. Just because he currently has them as files, it does not mean that they need to remain that way, or should ever have been that way. Also, I never for a second suggested using SQL Server for the job ;) –  Jan 09 '14 at 00:07
  • 2
    5 years ago I inherited a system with 1 million files in a single folder, and about 10,000 new 1-4KB files every day. I decided to throw them all into an ISAM table because "Hey, they're just plain text for analysing!" and then that turned out to be a huge mistake because I now had a single 12GB table with a squillion rows that were mostly doing nothing after they were processed. So I switched back to putting them in a filesystem with heirachial folders based on the GUID of the filename. – Mark Henderson Jan 09 '14 at 00:12
  • (why a single 12GB table with a squllion rows was a problem was a different matter that I won't get into here) – Mark Henderson Jan 09 '14 at 00:15
  • 2
    @MarkHenderson: It's not a different problem, that's WHY you said it was the wrong solution ("...huge mistake because I now had a single 12GB table with a squillion rows...."). You choose the wrong database engine / table format, but the concept of putting lots of small things into a single file with an INDEX is sound, so long as you do it right. What you want is a database that excels at key/value stores for millions of small objects, with auto-sharding. Also note that he's specifically not even caring about performance, just space. –  Jan 09 '14 at 00:19