Is there a way to estimate how much space a file or directory of a given size will take up after being compressed with tar and bzip2?

0

Due to an imminent distro switch, I would like to backup my home directory. However, my home directory is 29 gigabytes. I would like to know how much space this would take up after being compressed with tar cvjf home.tar /home. Is there a way that I can use to determine the size after compression?

Blue-Maned Hawk

Posted 2019-10-20T03:29:55.537

Reputation: 347

129GB fits on most USB drives easily, and even on many cloud providers. So, I'd skip the compression but use something like rsync instead. (Also, you very likely need a proper backup mechanism any day of the week!) – Arjan – 2019-10-20T08:38:25.813

Due to an imminent distro switch, I would like to backup my home directory You should backup you home directory regularly, and not wait for a distro switch. – xenoid – 2019-10-20T12:59:22.277

@Arjan, the problem is that most USB flash drives are formatted as FAT32 or exFAT, which both limit their file sizes at 4GB, I think. – Blue-Maned Hawk – 2019-10-21T05:41:22.197

Good point, so: format an external disk differently ;-) (Agreed, also user/group access right are easily preserved using an archive. Still then, archiving won't make a good way for daily backups. Of course, that's not what you're asking. Still, as for using rsync for that, see Time Machine on Ubuntu?.)

– Arjan – 2019-10-21T10:16:55.140

Answers

2

It's not possible to know for certain what size data will compress to without actually compressing it. What you can do get an educated guess based on the content you have in your home directory. I'm not aware of any tools to do this automatically, but it's not a difficult process.

Many modern file formats are already compressed, meaning running it through compression again will give you little to no (or negative) gain. This type of data you're better off skipping the compression and simply copying or archiving it as is. Examples of this would be compressed video (mp4, webm, mov, etc), compressed images (jpeg, png, etc), existing archives (zip, rar, gz, bz2, etc), and more.

Text files will generally compress fairly well, especially if there is a lot of repeated data (ie, log files). You could try sampling a subset of files to see how they compress and use that as a guess or use something like 50% as a rough estimate.

Finally, see what portion of your data is made up of each type and multiple that by your estimated percentage to get an estimate of your final size. For example, if 20GB of your data is compressed data and 9GB is text files, your final compressed data size would probably range from 21GB to 25GB.

kicken

Posted 2019-10-20T03:29:55.537

Reputation: 906

1

Unfortunately, no. The only way to see how large a compressed archive will be, is to create the compressed archive. There is no tool that would do this, as the tool would be doing all the work of the compression program, without writing the final archive, which would be a waste of time.

Perhaps you should consider breaking your data down into manageable chunks and create several archives. This will allow you to break the large amount of time it will take to archive 29 GB, into smaller slices.

Keltari

Posted 2019-10-20T03:29:55.537

Reputation: 57 019

That's an interesting proposal. I think I'll try that. – Blue-Maned Hawk – 2019-10-20T04:23:55.923

1

The tool is tar (with bzip2 implicitly involved because of j you used) piped to wc (which is a standard (POSIX) tool to count bytes). The following command will print the size in bytes:

tar cj /home | wc -c

The command really does (and I'm citing another answer here) "all the work of the compression program, without writing the final archive, which would be a waste of time"; but if you really want to know then this is the only firm way.


You can improve the overall approach like this:

tar cj /home | tee home.tbz2 | wc -c
  • If you're lucky and the space you have for home.tbz2 turns out to be enough then you will get no error from tee and the file will end up of size equal to what wc -c will report.
  • Otherwise tee will report no space left, yet it will continue writing to its stdout. wc -c will tell you how big the file would be. The actual (incomplete) file will be smaller and you should delete it afterwards.

While using tar with v you may miss a no space left message. Still you can tell what happened by comparing the output you got from wc -c to the actual size of home.tbz2. In Bash you can retrieve the exit status of tee with ${PIPESTATUS[1]}.

Kamil Maciorowski

Posted 2019-10-20T03:29:55.537

Reputation: 38 429