Goal

I want to create backups of a large file pool using disks that are too small to store everything.

Description of situation

I have a NAS / home server (running Ubuntu 18.04) with a RAID. The capacity of the RAID is much larger than any of the external hard drives I own and that I want to use to back up what's on the RAID.

Current approach

The way I currently do it is to try to manually find as few subtrees of the directory structure as possible to be only slightly below the capacity of one of the drives, compute the checksum of each of the files via

find /path/to/subtree/ -type f -print0 | sort -z | xargs -r0 sha256sum > sha256sumsBeforeCopying

, copy the files over via

cp -a origin destination

, and verify and store the check sum after copying.

I also try to do this in a way that makes sure I still have a different drive containing an old copy of the subtrees I'm currently copying because I clear out drives before doing the process described above.

Question

This, of course, is a time-consuming process. How can I achieve my goal of backing up data onto many small drives more efficiently in terms of effort?

UTF-8

Posted 2019-11-26T14:07:54.053

Reputation: 620

1Pretty much any backup program is designed to handle this exact situation – Keltari – 2019-11-26T14:17:21.457

@Keltari Can you please name one? I use borgbackup on my desktop computer and on my laptops. Pretty sure it can't handle many small drives. I used to use Deja Dup. Pretty sure it wasn't able to do that either. – UTF-8 – 2019-11-26T15:18:34.977

How many drives could you connect at once? You might be interested in creating a RAID0-system with all drives connected. However that means always connecting all drives for accessing and of course poses danger to all data once one drive fails. So be aware of these pitfalls. – Fiximan – 2019-11-26T16:04:20.440

@Fiximan I actually could connect all drives at once. However, I do not want to do that for two reasons. First of all, it means that if any one of the drives failed, I wouldn't have a backup anymore. Sure, in my current situation if one drive fails I'll lose some data but I won't lose all. But the stronger reason for me not to do that is that if my server gets infected with crypto malware while I'm performing a backup, I won't have a backup nor the original data. – UTF-8 – 2019-11-26T18:50:56.867

2Verifying the backup! Give this person a cigar! – K7AAY – 2019-11-26T18:55:14.437

Answers

Since you have the capability of connecting all drives at once, the simplest option seems to be using mergerfs which allows you to merge several file systems. The basic concept from man mergerfs:

How it works

mergerfs logically merges multiple paths together. Think a union of sets. [...]

         A         +      B        =       C
          /disk1           /disk2           /merged
          |                |                |
          +-- /dir1        +-- /dir1        +-- /dir1
          |   |            |   |            |   |
          |   +-- file1    |   +-- file2    |   +-- file1
          |                |   +-- file3    |   +-- file2
          +-- /dir2        |                |   +-- file3
          |   |            +-- /dir3        |
          |   +-- file4        |            +-- /dir2
          |                     +-- file5   |   |
          +-- file6                         |   +-- file4
                                            |
                                            +-- /dir3
                                            |   |
                                            |   +-- file5
                                            |
                                            +-- file6

There are a few options, regarding what happens, when identical file paths occur in multiple drives, permission policies etc., but since you are assumingly backing up to an empty set of drives, this is not really of relevance and the defaults will do.

How to do it:

1) Say you have three external drives with each one partition: sdb1, sdc1, sdd1

Just mount them as usual:

mkdir /mnt/sd{b,c,d}1
for drive in sd{b,c,d}1 ; do mount /dev/$drive /mnt/$drive ; done

2) Create a combined space and "mount" the drives into your combined FS

mkdir /mnt/merged
mergerfs -o fsname=mergerFS /mnt/sdb1:/mnt/sdc1:/mnt/sdd1 /mnt/merged

Syntax mergerfs -o <options> <colon:separated:list:of:sources> <target>

Options can be skipped, as the defaults will do, however is seems nice to specify a pseudo filesystem name to quickly find it in the output of df.

3) Just use /mnt/merged as backup location and verify your backup.

4) After you are done, you can disassemble the merged drive by e.g. unmounting

umount /mnt/merged

You will now find the files in the different drives, but no file should be split up. Individual drives are still fully function on their own.

Notably mergerfs is based on fuse and thus even non-root users can merge and unmerge, allowing maintaining the standard user read/write permissions.

mergefs /media/john/drive1:/media/john/drive2 /home/john/merged_drive
fusermount -u /home/john/merged_drive

Also merging multiple FS-types is possible, but I cannot tell about data safety regarding write operations to different FS. Nevertheless, this should not matter as you for sure will have all backup drives with the same FS, I hope.

mergerfs should be available in the standard repositories of most distros.

Edit: Side note - sources do not necessarily have to be full drives but can be directories.

Fiximan

Posted 2019-11-26T14:07:54.053

Reputation: 199

the lowly and oft-underestimated tar command (at least GNU tar) seems to work beautifully.

I tried this on a 30+MB directory using a 10MB limit on each tar file, supplying the names of the next tar file via a list. (The prompt still appears, without a newline, so you'll see a long line of prompts but other than being visually jarring, it seems to do no harm. As such, in the output below I have added one newline to make the output of the wc show up more clearly).

$ ls -al file*tar
ls: cannot access 'file*tar': No such file or directory

$ cat list  # notice list starts from file2, because the tar command itself starts with file1
n file2.tar
n file3.tar
n file4.tar

$ tar -cf file1.tar -M -L 10M t/ < list
Prepare volume #2 for ‘file1.tar’ and hit return: Prepare volume #3 for ‘file2.tar’ and hit return: Prepare volume #4 for ‘file3.tar’ and hit return: %

$ ls -al file*tar
-rw------- 1 ff ff 10485760 Dec  3 10:02 file1.tar
-rw------- 1 ff ff 10485760 Dec  3 10:02 file2.tar
-rw------- 1 ff ff 10485760 Dec  3 10:02 file3.tar
-rw------- 1 ff ff  1392640 Dec  3 10:02 file4.tar

$ tar -tf file1.tar -M < list | wc
Prepare volume #2 for ‘file1.tar’ and hit return: Prepare volume #3 for ‘file2.tar’ and hit return: Prepare volume #4 for ‘file3.tar’ and hit return:
30      74    1106

sitaram

Posted 2019-11-26T14:07:54.053

Reputation: 296

Backing up data onto many small disks

Goal

Description of situation

Current approach

Question

Answers