I am designing a system capable of working with 15 million (and growing) image files ranging from 100k to 10mb. I am looking for some opinions on what may be the best filesystem to support the (somewhat) odd requirements:
Additional Information/Requirements:
- The directory structure is certain non-optional [1] , but due to the design of the applications pulling this data, it is relatively immutable.
- The data should be read optimized which includes, but may not be limited to: random reads, sequential reads, directory listings (some directories may have 30,000 directories or 1,000 images), etc.
- Additional data will be written to the file structure (new sub directories, additional files in existing sub directories, etc) on a semi-regular basis, however write performance is not much a concern. Data will be written via SMB or NFS.
- There is a significant number of identical files (conservative estimate is 20%), however due to the design of the application pulling this data, we can't delete the duplicate filenames. Ideally we would like some sort of deduplication (we could certainly hard link, but I am not sure how millions of hard links would scale)
- SSDs will be the primary form of storage for this project (unless an argument can be made for spinners instead) so we would like to limit writes to the system where possible.
The hardware we have allocated for this project is as follows:
Dell R720xd w/ 24x 2.5” bays
RAM: 128GB RAM (more can be allocated if needed)
CPU: 2x E5-2620 @ 2.20GHz
Storage:
8x2TB SSDs local storage
1x500GB SSD for OS
RAID: H310 (IT Mode)
We were initially considering ZFS for this, but after some additional research it appears:
- ZFS may thrash the SSDs when writing metadata updates.
- ZFS has a high RAM requirement for deduplication (5GB RAM per 1TB of data). This should be doable on our current hardware though, it just seems like a lot of overhead.
- RiserFS may be better suited for random lookup on small files (I can't seem to find what qualifies for a "small" file).
Any opinions on an optimal filesystem for this use case as well as any hardware tweaks would be much appreciated.
[1]
Example directory structure (none of the directories or filenames are normalized (sequential, etc) in any way)
+ root directory 1
- sub directory 1
- image 1
- image 2
- image 3
- ...
- image n (where n is between 1 and 1,000+)
- sub directory 2
- image 1
- image 2
- image 3
- ...
- image n
....
- sub directory n (where n is between 1,000 and 30,000)
- image 1
- image 2
- image 3
- ...
- image n
+ root directory 2
+ ...
+ root directory 15