How can I compress a file on Linux in-place, without using additional disk space?

21

5

I've got a 100GB drive that has a 95GB file. I need to free up some space on the drive (and right now transferring the file off the drive is not an option). The file would compress well with gzip or bz2 or whatever, but all these programs write the compressed file to a separate file. I don't have enough free space for this.

Is there a way using standard compression tools or other Unix utilities to compress the file without using any additional disk space (or at least a minimal amount of additional disk space)? I'm picturing something that compresses part of the file at a time and writes the results directly over the file. I realize this would be risky, as the file would be corrupted if the compression was interrupted, but I don't think I have a choice.

Lee

Posted 2012-01-14T02:49:54.370

Reputation: 219

1One last option we used to use at my old place was to have a dir somewhere which contained a whole bunch of 1G files filled with garbage. Then, if you got into a pinch, you could remove some of them to give you a bit of emergency space. – None – 2015-05-20T11:55:05.340

Answers

14

This is a proof of concept bash one-liner, but it should get you started. Use at your own risk.

truncate -s `gzip -c file | dd of=file conv=notrunc 2>&1 | sed -n '$ s/ .*$// p'` file
mv file file.gz

This works by piping gz data to a dd process that writes it back to the same file. Upon completion, the file is truncated to the size of the gz output.

This assumes that the last line of dd's output matches:

4307 bytes (4.3 kB) copied, 2.5855e-05 s, 167 MB/s

Where the first field is an integer of bytes written. This is the size the file will need to be truncated to. I'm not 100% sure that the output format is always the same.

user710307

Posted 2012-01-14T02:49:54.370

Reputation: 141

1Isn't it possible that at any time the compression program (e.g. gzip) writes more header and data bytes than the original data bytes, thus overwriting some parts of the file? I guess this depends on the chosen compression program. Has anyone an idea how to prevent this from happening or how (im)probable it is? – Daniel Böhmer – 2016-01-18T15:01:50.183

Not including the header, the worst case for gzip is 4 bytes extra per 65535 bytes of input (or around 4MB per 65GB). But I don't know of any currently existing tool that will buffer the stream that way. – mwfearnley – 2017-07-28T21:49:02.227

^ Sorry, that should be 5 bytes per 65535, not 4. – mwfearnley – 2017-07-28T23:18:49.650

Nifty trick. Could you explain why conv=notrunc is necessary? – sleske – 2012-01-17T15:35:57.270

Maybe it's not. gzip -c file | dd of=file appears to work just as well. – user710307 – 2012-01-17T20:56:13.107

user710307: Good point. It does seem to work (though I'm not quite sure why). Care to edit your answer? – sleske – 2012-01-18T08:17:11.483

BTW, to educate myself, I asked a separate question about how this works behind the scenes: http://superuser.com/questions/379718/compressing-a-file-in-place-does-gzip-c-file-dd-of-file-really-work

– sleske – 2012-01-18T08:23:51.023

Interesting. After thinking about it a bit more, it worked on the test dump I did, but they were plain text files which compress well. It's possible for gz output to be larger than its input -- in which case reading and writing would overlap... and that would be bad. :) Though it may be possible to work around that by playing with the buffer sizes. – user710307 – 2012-01-18T12:53:18.753

1People at the linked question tried it (and I tried it, too); it does not work in general. Seems it only works for very small files - maybe because gzip will read a small file into RAM before compressing it. For large files (a few MB), it does not work, even if they are compressible. – sleske – 2012-01-18T15:17:39.600

3Yep. So conv=notrunc is necessary. – user710307 – 2012-01-18T15:51:17.033

8

It's not so much that gzip and bzip2 overwrite the original. Rather, they write the compressed data to disk as a new file, and if that operation succeeds, they unlink the original uncompressed file.

If you have sufficient RAM, you could write a script to temporarily compress the files in atmpfs filesystem, then remove the original on disk and replace it with the compressed version. Maybe something like this:

# some distributions mount /dev/shm as tmpfs; replace with bzip2 if you prefer
if gzip -q9c /full/disk/somefile > /dev/shm/somefile.gz
then
    rm -f /full/disk/somefile && mv -i /dev/shm/somefile.gz /full/disk
fi

Just be mindful of your memory usage, since tmpfs is essentially a RAM disk. A large output file could easily starve the system and cause other problems for you.

James Sneeringer

Posted 2012-01-14T02:49:54.370

Reputation: 448

1That's just crazy enough to work – Andrew Lambert – 2012-01-14T05:38:01.117

I like to push the envelope. – James Sneeringer – 2012-01-15T05:21:39.507

3

There is no tool that works this way, for precisely the reason you give. Few people are willing to write a tool that deliberately implements risky behavior.

Ignacio Vazquez-Abrams

Posted 2012-01-14T02:49:54.370

Reputation: 100 516

I was hoping that it would be an unsafe, non-default option to a utility. Could you think of an alternative? Is there a way to truncate a file in place to, e.g. remove the first 2 GB? That would let me use my limited free space to compress one chunk at a time, shrinking the source file as i went. – Lee – 2012-01-14T03:03:29.487

1There's really no sane way to remove data from the beginning of a file on any filesystem, with any tool. – Ignacio Vazquez-Abrams – 2012-01-14T03:12:17.003

2But you can remove data from the end of the file. It can be done in principle. You slice data off the end of the file to put in separate files, truncating the original files as you go. Then you compress the files in forward order, deleting them as you go. It would be a pain to implement and if anything went wrong you'd be screwed. But it's possible. – David Schwartz – 2012-01-14T06:07:08.073

1

The split and csplit commands could be used to split the large file up into smaller parts, and then compress them individualy. Reassembling would be rather time consuming though.

Brian

Posted 2012-01-14T02:49:54.370

Reputation: 2 934

Splitting the file with split or csplit has the same problem as the original question, namely that you'd have to have enough free disk space to hold the original file, plus all the split files. – Brian Minton – 2020-02-10T23:03:58.697

Another good option. One could probably write some script to do this. However, this yields many separately compressed files, which will need to be re-concatenated after uncompressing, which is not so nice. – sleske – 2012-01-17T15:38:03.553