4

I am designing a system capable of working with 15 million (and growing) image files ranging from 100k to 10mb. I am looking for some opinions on what may be the best filesystem to support the (somewhat) odd requirements:

Additional Information/Requirements:

  • The directory structure is certain non-optional [1] , but due to the design of the applications pulling this data, it is relatively immutable.
  • The data should be read optimized which includes, but may not be limited to: random reads, sequential reads, directory listings (some directories may have 30,000 directories or 1,000 images), etc.
  • Additional data will be written to the file structure (new sub directories, additional files in existing sub directories, etc) on a semi-regular basis, however write performance is not much a concern. Data will be written via SMB or NFS.
  • There is a significant number of identical files (conservative estimate is 20%), however due to the design of the application pulling this data, we can't delete the duplicate filenames. Ideally we would like some sort of deduplication (we could certainly hard link, but I am not sure how millions of hard links would scale)
  • SSDs will be the primary form of storage for this project (unless an argument can be made for spinners instead) so we would like to limit writes to the system where possible.

The hardware we have allocated for this project is as follows:

Dell R720xd w/ 24x 2.5” bays
RAM: 128GB RAM (more can be allocated if needed)
CPU: 2x E5-2620 @ 2.20GHz
Storage:
    8x2TB SSDs local storage
    1x500GB SSD for OS
RAID: H310 (IT Mode)

We were initially considering ZFS for this, but after some additional research it appears:

  • ZFS may thrash the SSDs when writing metadata updates.
  • ZFS has a high RAM requirement for deduplication (5GB RAM per 1TB of data). This should be doable on our current hardware though, it just seems like a lot of overhead.
  • RiserFS may be better suited for random lookup on small files (I can't seem to find what qualifies for a "small" file).

Any opinions on an optimal filesystem for this use case as well as any hardware tweaks would be much appreciated.

[1]

Example directory structure (none of the directories or filenames are normalized (sequential, etc) in any way)

+ root directory 1
    - sub directory 1
        - image 1
        - image 2
        - image 3
        - ...
        - image n (where n is between 1 and 1,000+)
    - sub directory 2
        - image 1
        - image 2
        - image 3
        - ...
        - image n
    ....
    - sub directory n (where n is between 1,000 and 30,000)
        - image 1
        - image 2
        - image 3
        - ...
        - image n
+ root directory 2
+ ...
+ root directory 15
Josh
  • 41
  • 3
  • 9
    I'm quite surprised you're actually considering ReiserFS. Reiser3 is as dead as Hans Reiser's wife, and Reiser4 is unlikely to ever be included in Linux. Either way you're using out of tree kernel modules, but at least people actually use ZFS. If you need something actually in the kernel for supportability, XFS is your only option. – Michael Hampton Nov 27 '18 at 00:37
  • Thanks for the ReiserFS information @MichaelHampton. I wasn't aware of that - I am primarily a developer, so filesystem level design is a bit of a new venture for me. – Josh Nov 27 '18 at 00:52
  • 1
    What is the IO going to look like? Are there going to be 1,000 different sources reading your data, or is this going to be accessed by a single application server? – RobbieCrash Nov 27 '18 at 07:54
  • @RobbieCrash this is an internal application that will be accessed by a few users, lets say no more than 5 simultaneous connections at a time. – Josh Nov 27 '18 at 10:27

1 Answers1

3

Any filesystem (including lowly ext4 and slightly-less-lowly XFS) can meet the requirements you’ve listed, which are basically the ability to store lots of files and reasonable performance in a wide variety of use cases. My knowledge (and the interesting trade offs in this answer) is mainly about ZFS, so I’ll focus on that.

The additional abilities you would get from ZFS are:

  1. Dedup. As you said, this is not super wonderful in ZFS because it has a heavy RAM requirement, but it does work. To get something similar on non-ZFS, you could hash your files and use the hashes as filenames / directory names, or keep a database of hash -> file name so you can make hard links. (In any of those cases you’d need to have exactly the same files, not just images that look the same).
  2. Compression. Most images are already compressed so this might not buy you much, but if they’re RAW instead of JPEG, this could be a big savings. If not, this won’t buy you much.
  3. Ability to snapshot / back up. ZFS has great built-in tools for this. You can back up non-ZFS too, although it might be hard to get a consistent snapshot of your data. LVM can do some of this, although arguably not as well.
  4. Volume management is a part of ZFS. You can choose from a set of very flexible RAID configurations to get the optimal configuration of [data redundancy, space usage, performance] for your particular application. You can get some of this from LVM and other software RAID, but I believe ZFS has one of the best-designed solutions for volume management out there, combined with a well-designed system for failure detection and recovery.

Two other things you mentioned:

  • Thrashing metadata. I don’t think ZFS would be worse than other filesystems: it does update a fair amount of metadata during writes, but it’s copy on write and it does those updates in batches every 5-10 seconds, which means that large contiguous writes are happening instead of small in-place writes that require NAND blocks to be erased and rewritten many times. In a traditional filesystem you’ll end up with the other way because it will do in-place updates, which is probably slightly worse. At any rate, modern SSDs have a lot of extra blocks internally that they reserve to extend the life of the drive in the presence of wear — normal drive lifetimes are considered comparable to disk lifetimes. I’m not saying it doesn’t matter, I just don’t think you should fixate too much on this aspect since it’s pretty minor.
  • Hard link scalability. Should scale as well or better than normal files (in ZFS or not). Either way, a hard link is just a pointer to the same inode as some other file, and you’ll probably get a very small cache efficiency win since reading that file from one of the links will make it cached for accesses through the other links too.
Dan
  • 270
  • 2
  • 8