26

I run a website where about 10 million files (book covers) are stored in 3 levels of subdirectories, ranging [0-f]:

0/0/0/
0/0/1/
...
f/f/f/

This leads to around 2400 files per directory, which is very fast when we need to retrieve one file. This is moreover a practice suggested by many questions.

However, when I need to backup these files, it takes many days just to browse the 4k directories holding 10m files.

So I'm wondering if I could store these files in a container (or in 4k containers), which would each act exactly like a filesystem (some kind of mounted ext3/4 container?). I guess this would be almost as efficient as accessing directly a file in the filesystem, and this would have the great advantage of being copied to another server very efficiently.

Any suggestion on how to do this best? Or any viable alternative (noSQL, ...) ?

BenMorel
  • 4,215
  • 10
  • 53
  • 81
  • What file system are you using right now? – cmcginty May 29 '11 at 19:36
  • NetApp is lickly to be an option if you can afort the prices – Ian Ringrose May 29 '11 at 21:53
  • I'm using ext4 under CentOS 5.6 – BenMorel May 30 '11 at 12:14
  • 1
    Curious why it should take "many days just to browse the 4k directories holding 10m files", which seems way too slow. Assuming 150 bytes per pathname, the 10m filenames make 1.5 GB of data, so it could be the available memory/CPU (including sorting the result). Also, check if enabling/disabling dir_index helps: http://lonesysadmin.net/2007/08/17/use-dir_index-for-your-new-ext3-filesystems/#comment-76315 plus various tips at http://serverfault.com/questions/183821/rm-on-a-directory-with-millions-of-files – RichVel Sep 06 '11 at 12:43
  • 1
    Note 5 years later: I've migrated everything to Amazon S3, which is perfectly suited for storing such a large amount of files. Plus I don't have to split files into 3 levels of sub-directories anymore, as for S3 it makes no difference (a path is a path, whether it contains slashes or not makes no difference). And I can sleep better, knowing that my data is safely replicated across several locations. – BenMorel Jan 19 '16 at 10:29
  • 1
    Same here! After running into strange problems when saving millions of files into one AND multiple folders (hash collision errors, inode errors, 64k folder limit and lots of other BS), we gave up and migrated to DigitalOcean Space (same like Amazon S3). Way way slower than self-hosting, but it's a solution. – Sliq Feb 01 '20 at 20:58

12 Answers12

12

Options for quickly accessing and backing up millions of files

Borrow from people with similar problems

This sounds very much like an easier sort of problem that faces USENET news servers and caching web proxies: hundreds of millions of small files that are randomly accessed. You might want to take a hint from them (except they don't typically ever have to take backups).

http://devel.squid-cache.org/coss/coss-notes.txt

http://citeseer.ist.psu.edu/viewdoc/download;jsessionid=4074B50D266E72C69D6D35FEDCBBA83D?doi=10.1.1.31.4000&rep=rep1&type=pdf

Obviously the cyclical nature of the cyclic news filesystem is irrelevant to you, but the lower level concept of having multiple disk files/devices with packed images and a fast index from the information the user provides to look up the location information is very much appropriate.

Dedicated filesystems

Of course, these are just similar concepts to what people were talking about with creating a filesystem in a file and mounting it over loopback except you get to write your own filesystem code. Of course, since you said your system was read-mostly, you could actually dedicate a disk partition (or lvm partition for flexibility in sizing) to this one purpose. When you want to back up, mount the filesystem read-only and then make a copy of the partition bits.

LVM

I mentioned LVM above as being useful to allow dynamic sizing of a partition so that you don't need to back up lots of empty space. But, of course, LVM has other features which might be very much applicable. Specifically the "snapshot" functionality which lets you freeze a filesystem at a moment in time. Any accidental rm -rf or whatever would not disturb the snapshot. Depending on exactly what you are trying to do, that might be sufficient for your backups needs.

RAID-1

I'm sure you are familiar with RAID already and probably already use it for reliability, but RAID-1 can be used for backups as well, at least if you are using software RAID (you can use it with hardware RAID, but that actually gives you lower reliability because it may require the same model/revision controller to read). The concept is that you create a RAID-1 group with one more disk than you actually need connected for your normal reliability needs (eg a third disk if you use software RAID-1 with two disks, or perhaps a large disk and a hardware-RAID5 with smaller disks with a software RAID-1 on top of the hardware RAID-5). When it comes time to take a backup, install a disk, ask mdadm to add that disk to the raid group, wait until it indicates completeness, optionally ask for a verification scrub, and then remove the disk. Of course, depending on performance characteristics, you can have the disk installed most of the time and only removed to exchange with an alternate disk, or you can have the disk only installed during backups).

Seth Robertson
  • 1,119
  • 6
  • 10
  • Very complete answer, which summarises good solutions. I think I'll keep my existing filesystem structure, and use LVM snapshots, which seems to be perfect for my use case. – BenMorel May 30 '11 at 14:40
9

You could mount a virtual filesystem using the loopback manager but while this would speed up your backup process, it might affect normal operations.

Another alternative is to backup the entire device using dd. For example, dd if=/dev/my_device of=/path/to/backup.dd.

  • +1 Backing up the device itself is a good idea. – asm May 29 '11 at 17:55
  • 3
    You should, if you use this approacht, test the restore (well, you should always do that), because if your input is a disk like /dev/sdd, dd will store the partition sheme and sizes. If you restore it to a smaller disk, you will get errors, and if you restore it to a bigger disk, it will show up truncated. It will work best, if you restore the data to another exemplar of the same disk type. Restoring partitions only (/dev/sdd1) will be less troublesome. – user unknown May 29 '11 at 18:09
  • 1
    Note that if the device is on LVM, a backup can also be performed without unmounting the disk using LVM snapshots. – bdonlan May 29 '11 at 21:11
  • I second the LVM snapshot backup approach. I leveraged lvm in the past for live DR replication. Using dd in combination with snapshots makes it easy to do quick block-level backups. – slashdot May 30 '11 at 00:31
  • I tried `dd` over `nc` and this does a good job! However I might have inconsistent/corrupted data, as opposed to using LVM snapshots instead of the live partition. – BenMorel May 30 '11 at 14:42
9

As you probably know, your problem is locality. A typical disk seek takes 10ms or so. So just calling "stat" (or open()) on 10 million randomly-placed files requires 10 million seeks, or around 100000 seconds, or 30 hours.

So you must put your files into larger containers, such that the relevant number is your drive bandwidth (50-100 MB/sec for a single disk, typically) rather than your seek time. Also so you can throw a RAID at it, which lets you crank up the bandwidth (but not reduce seek time).

I am probably not telling you anything you do not already know, but my point is that your "container" idea will definitely solve the problem, and just about any container will do. Loopback mounts will likely work as well as anything.

Nemo
  • 344
  • 1
  • 8
  • Yup, locality is crucial. Look at your usage patterns. Most problems tend to follow the Pareto Principle (80% of processes hitting 20% of data), so if you could figure out which files need to be cached in RAM, or just put on a separate partition with a different layout of directories, so it takes less directory lookups or seeks, it would probably help a lot. Spreading the frequently accessed files on different spindles of disks so seeks could be done in parallel could also help. +1 for @nemo for bringing up locality of reference. – Marcin May 30 '11 at 15:24
5

There are a couple of options. The simplest, and should work with all Linux filesystems, is to dd copy the entire partition (/dev/sdb3 or /dev/mapper/Data-ImageVol) to a single image and archive that image. In case of restoring singular files, loopback mount the image (mount -o loop /usr/path/to/file /mountpoint) and copy out the files you need. For a full partition restore, you can reverse the direction of the initial dd command, but you really do need a partition of identical size.

Judging from your use-case, I'm guessing individual file-restores are a very infrequent event, if they ever occur at all. This is why an image-based backup really makes sense here. If you do need to make individual restores more often, using staged LVM snapshots will be a lot more convenient; but you still need to do the image-based backup for those critical "we lost everything" disasters. Image-based restores tend to go a lot faster than tar-based restores simply because it's just restoring blocks, it isn't incurring quite a bit of metadata operations with every fopen/fclose, and can also be a highly sequential disk-operation for further speed increases.

Alternately, as the Google video @casey pointed to mentions about half way through, XFS is a great filesystem (if complex). One of the nicer utilities with XFS is the xfsdump utility, which will dump an entire filesystem to a single file, and generally do so faster than tar can. It's a filesystem-specific utility, so can take advantage of fs internals in ways that tar can't.

sysadmin1138
  • 131,083
  • 18
  • 173
  • 296
  • Lots of good answers there! XFS seems to be interesting, but I'm afraid it's a bit out of my reach. – BenMorel May 30 '11 at 14:45
3

I would suggest you first try upgrading to EXT4, if you are not running it already.

Google has done a lot of research into why EXT4 is a good idea.

After that you should look into deploying a distributed file system architecture. For example:

cmcginty
  • 1,263
  • 15
  • 24
2

Perhaps a simplistic answer, but my first thought was to use something like GridFS which is built onto of MongoDB. Many of the primary language drivers support it out of the box, so you should be able to just swap it out with the file-reading sections of your code. Also, you could just make your existing directory paths the keys to these files.

One problem you might have is that Mongo tends to slow down pretty fast if it's seeking from disk all the time. With 10 million files, I expect most of your data will be on disk. The chunks of files in GridFS are 4MB, as I recall, so if you're files are bigger than that you'll do be doing several costly operations to get one file. The key, I think, would be to shard your files based on your already tidy directory structure so that you could have several instances of Mongo running on several boxes to lighten the load. However, I don't know what your performance requirements are either so I might be over-thinking it.

What's the benefit of all of this? Performance that pretty closely matches disk reads if done right. Also, Mongo comes with several great built-in ways to backup the whole swath of data in a DB instance quickly, and even with the database still running.

daveslab
  • 187
  • 1
  • 10
  • Will definitely have a look at GridFS which I didn't know, but I think I will end up keeping everything filesystem-based to lower the amount of work, as everything is already working! – BenMorel May 30 '11 at 12:32
1

If you'd be happy with an appliance model for your data storage, maybe you could consider NexentaStor. It runs ZFS on OpenSolaris under the hood but all administration is through a web GUI.

There are a couple of features that would help with your issue.

  • The Enterprise version supports a form of remote replication based on snapshots which does not require scanning through the whole filesystem.

  • If you don't mind getting your hands dirty, ZFS has a very handy ZFS diff command which efficiently tells you which files have been added, modified, or deleted since the last snapshot, without needing to scan through the whole filesystem. You could incorporate this into your backup system to greatly reduce the time required to perform incremental backups.

Tom Shaw
  • 3,702
  • 15
  • 23
1

You can use a standard dump utility For backing up EXT4 filesystem with lots of files. This utility first checks which blocks are used on a filesystem and then backs them up in disk order, eliminating most seeks.

There's a corresponding restore utility for restoring backups created by dump.

It supports incremental backups using levels - level 1 backups files modified from last level 0 (full) backup, level 2 - modified from level 1 backup and so on.

Tometzky
  • 2,649
  • 4
  • 26
  • 32
0

For incremental backups, one option would be to have a second, shadow tree for new covers. That is, you'd have your main tree which is used for all read operations. You'd also have a newfiles/012345.....jpg directory; newly added covers create a hardlink here as well as in the main tree. When performing backups, you can backup the main tree occasionally, but backup the (much smaller) newfiles tree much more regularly.

Note that in order to keep the newfiles tree small, prior to performing a new backup of the main tree, you can empty the newfiles tree:

mv newfiles newfiles_
mkdir newfiles
rm -rf newfiles_

Once you do this, of course, you are committed to producing a new backup of the main tree.

bdonlan
  • 683
  • 7
  • 14
  • Interesting approach, thanks for sharing it. But I'm afraid it would involve a lot of changes in the application, and it would be difficult to keep the application and the storage needs in two separate layers. – BenMorel May 30 '11 at 14:47
0

Adding a little bit of concurrency usually helps.

I have a similar problem than you; in my case I have to back up around 30 million files, most of them HTML, PHP or JPEG files. For me BackupPC + rsync over ssh works kind of OK; full backup takes around one day, but incrementals will usually finish in couple of hours.

The trick is to add each main level directory (0, 1, 2 ... a, b, c...) as a new target to copy in BackupPC and let it perform the backup in parallel, so it simultaneously backs up directories a/, b/, c/* and so on. Depending on your disk subsystem anything between couple of processes to around 10 processes is probably the fastest way to back up.

LVM snapshots and block-level backup is also an option, but with BackuPC and file-level backup you still can restore individual files or directories if needed.

Janne Pikkarainen
  • 31,454
  • 4
  • 56
  • 78
  • I am surprised that backing up the root directories concurrently solves the problem for you, I would expect that to be actually slower. Are all the directories on the same disk? Are you using an SSD? – BenMorel Jun 01 '11 at 08:17
  • The data files are stored on SAN. – Janne Pikkarainen Jun 01 '11 at 09:09
  • Okay, makes sense now, you gain efficiency from accessing several files simultaneously, because your different folders are most likely physically located on different drives in the SAN, or at least replicated on several drives, which allows for concurrent access. I'm only based on a RAID-1, so I guess that above two concurrent accesses, my speed is very likely to go down. – BenMorel Jun 01 '11 at 16:26
0

Benjamin,

I think that your problem can be addressed at the number of files per directory level!

Does the access time changes by a significant factor if you store 20 000 files in a directory ?

Also did you though on storing the filesystem metadata on a separate faster access drive?(like a SSD).

Dragos
  • 349
  • 1
  • 2
  • 11
0

I'd recommend a good old relational database instead.

I'd use a PostgreSQL with, say, 256 partitioned tables (cover_00, cover_01, ..., cover_ff) with image data as bytea (binary) column with external storage, with file identifier as primary key. Retrieving an image would be fast (thanks to an index on primary key), data integrity would be guaranteed (ACID compliant database), backup would be in disk order, so not too much seeking.

Tometzky
  • 2,649
  • 4
  • 26
  • 32