Space effecient manner to store lots of large, similar files

Question

I have lots of ~1 GB files (database dump files, taken at regular intervals). Right now I'm just storing them all in one directory, each file gzipped. We're running out of disk space and want to continue to store the old ones. Ignoring the obvious solution of throwing money at the problem to buy more disks, is there any way to store these in a space effeciant manner?

Each file is a database dump file, taken every half hour, and hence there should be a lot of duplicate content. Is there some programme/process that'll make this easier. I don't want to try a new filesystem. I am playing around with git & git-repack, but that uses a lot of memory. Is there anything a bit simplier?

score 3 · Answer 1 · edited Apr 13 '17 at 12:43

Moving forward, you could take incremental backups of your database, but they take longer to restore from, and it's much more complex to do a point-in-time restoration from if you need to audit.

As you say you're able to take a full every 30 minutes right now, you could take both an incremental & full every 30 minutes, and only keep maybe a 6 or 24 hours, and the incrementals for the long term. (as in theory, if you need recovery speed, it's likely going to be a disaster recovery scenario, as you'll want the latest).

If you have questions about incremental backups, or other backup strategies, try the database stack exchange.

score 0 · Answer 2 · answered Sep 26 '11 at 15:36

In addition to incremental backups, you could also move older backups to near-line archival storage. This could include a combination of tape, external hard drive, optical media (with caveats), etc.

My experience is that having convenient access to working backups is good enough. If you require faster access to backups, you can buy more hardware or automate some of the retrieval steps to speed things up.

score 0 · Answer 3 · answered Sep 26 '11 at 16:15

You could consider de-duplication file storage since your data should have plenty of duplicate info. However, if you go with hardware solution from prominent vendor it will cost you way more then just cost of additional disks. The good news is that there are several open source initiatives and one of them is Opendedup. There are few more, but I do not have info on them handy.

Another alternative would be using backup software/service that already uses some sort of de-duplication. We are currently using a solution based on Asigra software and we are backing up entire VMware virtual machine images daily and we achieve 1:10 data reduction with 30 days daily retention.

Space effecient manner to store lots of large, similar files

3 Answers3