Effectively handling 2+ million files

Question

I have a file based DB that has about 2M files stored in 3 levels of subdirectories.

2/2/6253
2/2/6252
...

File a vary from 30 bytes to 60 KB. Whole DB is Read Only. DB is about 125 Gigabytes big.

Added: All files are compressed by zlib (python)

I want to handle it all as one file with file system in it. Which file system would be my best choose?

At the moment I use following script:

dd if=/dev/zero of=/my_file.iso bs=1024K count=60000
mkfs.ext4 -f /my_file.iso
mount -o loop /my_file.iso /mnt/

Also keep this in mind: http://stackoverflow.com/questions/5019371/storing-accessing-up-to-10-million-files-in-linux — Markus, May 04 '15 at 12:44
What's so special about your setup that you require a non-standard approach to this storage? E.g. what does "effectively" mean? — ewwhite, May 04 '15 at 13:37

score 7 · Accepted Answer · edited Apr 13 '17 at 12:13

You probably just want to use XFS.

It's quite capable of what you're asking for, and does the job.

There's no reason to complicate this with lesser-used filesystems, which can come with other tradeoffs.

Please see: How does the number of subdirectories impact drive read / write performance on Linux? and The impact of a high directory-to-file ratio on XFS

If you want something more esoteric, ZFS zvols with a filesystem on top could provide an interesting alternative (for compression, integrity and portability purposes).

See here: Transparent compression filesystem in conjunction with ext4

At the end I stay with ext4. Everything works very well. Thank you for good explanation. — Worker, May 05 '15 at 15:03

shodanshok · Answer 2 · 2015-05-04T13:16:44.063

2

If it is read-only, why to not use a ISO file? You can use genisoimage or mkisofs.

If you want to compress the whole thing, you can also use squashfs, another read-only filesystem with very high compression ratio.

edited May 04 '15 at 13:16

answered May 04 '15 at 13:08

shodanshok

44,038
6
98
162

1

What are advantages of ISO? – Worker May 04 '15 at 13:13
It is optimized to be a fast read-only filesystem. If you want to compress the whole thing, you can also use squashfs, another read-only filesystem with very high compression ratio. – shodanshok May 04 '15 at 13:16
ISO seems to be good, the only thing I am worried is random file access. Do you know how good ISO handle random file acces? – Worker May 04 '15 at 13:23
1

Mmm... I don't have a definite answer, but ISO is optimized for sequential access, not random one. You had to test that by yourself, it seems. – shodanshok May 04 '15 at 14:33

Fox · Answer 3 · 2015-05-04T13:30:17.997

2

Seeing the number of small files, I would consider using SquashFS. Especially if you have powerful enough CPU (meaning no Pentium III, or 1GHz ARM).

Depending on the type of data stored, SquashFS can greatly reduce its size and thus the I/O when reading it. Only downside is CPU usage on read. On the other hand, any modern CPU can decompress at speeds far outperforming HDD and probably even SSD.

As another advantage - you save space/bandwidth and/or time spent uncompressing after transfer.

Some benchmarks comparing it to ISO and other similar means. As with every benchmark, take it with a grain of salt and better, fake your own. ;-)

Edit: depending on circumstances (and im not daring to guess here) SquashFS without compression (mksquashfs -noD) could outperform ext4, as the code for reading should be much simpler and optimized for read-only operation. But that is really up to you to benchmark in your use case. Another advantage is the SquashFS image being just a little larger than your data. With Ext4 you have to always create larger loop device. Disadvantage is, of course, that it is rather uncomfortable, when you need to change the data. That is way easier with ext4.

edited May 04 '15 at 13:30

answered May 04 '15 at 13:16

Fox

3,887
16
23

Thank you for your input. I didn't mentioned that files are compressed already by zlib. Sorry for that - I will add it to the question now. – Worker May 04 '15 at 13:20
@MinimeDJ sure. Then squashfs won't have that big of a benefits. Is there any specific reason why you don't want to stick with what's working for you? – Fox May 04 '15 at 13:25
At the moment it is a working prototype. I want to have it as a single file because it would be easier to handle. Also I want to avoid fragmentation, so I prefer to keep it in read only FS. My DB serves critical app and response time must be always under 1 second. I have about 50 requests per second... – Worker May 04 '15 at 13:27
@MinimeDJ ok. Edited my answer. But your use case is rather unique, so you will have to do the testing/benchmarking yourself. But handling 50reads/s should be fine with SquashFS as well as with Ext4. btw. if you do not delete from ext4, you will hardly get any fragmentation. – Fox May 04 '15 at 13:37

Simon · Answer 4 · 2015-05-22T21:03:21.797

1

I am not sure if this fits your purpose, but have you considered tar to combine multiple files? That might decrease the pressure and space requirements on the filesystem, and your database application can read data for a specific file with one of the many tar libraries around.

Depending on your access pattern this might even increase the performance.

edited May 22 '15 at 21:03

answered May 04 '15 at 19:34

Simon

121
2

Effectively handling 2+ million files

4 Answers4

Linked