Tar a very large file directly to a drive

0

I was running my code to download some data on the Amazon cloud. The instance that I was running had an 8GB storage that I had bought with it plus 140GB extra instance storage (sort of plugged in as a hard drive). Now I downloaded my data into this extra 140GB storage that is nearly full.

I now want to "tar" this data and put into Amazon S3 (cloud storage - which I have already mounted onto my instance) so that I can download it. Now the problem is that tar on ubuntu(which is set up on that instance) creates some temporary files in the 8GB storage(which was the partition on which ubuntu is installed) and since it is not enough space(even for the tar file) is creates an incomplete tar in the S3. Could you suggest me a way out.

I tried copying the file to S3 and then splitting it so that I could tar the small pieces(I have another instance and I know I can tar around 70GB). But even cp creates a sort of temporary copy. Any way out ?

user533550

Posted 2014-04-20T15:34:35.210

Reputation: 31

Is your network connection stable enough that you do not need the tar file on the amazon site? If it is then you can use netcat to untar directly from the source tarball on your own system. (ON the receiving end, nc -l 4321 | tar -xf - (tar eXtract File, input std in, and std in filled by netcat listening on port 4321.). And then on the sending site a netcat to port 4321 on your amazon host. Something line nc IP.IP.IP.IP 4321 < mytarball.tar). – Hennes – 2014-04-20T15:42:13.580

No, I don't think that I have such a network. Besides I have 40 such instances and I don't think this would be possible for all of them. Thus I need to create a tar of the file in S3. Thanks for your help btw. Is it possible that I can tar half the file once and the other half in another iteration ? – user533550 – 2014-04-20T15:50:17.853

You say S3 is "mounted" - so it sounds like you are using s3fs -- and it seems likely that s3fs is what's using temp files, not tar and cp. If you enabled the disk cache in s3fs, the obvious first thing to try would be disabling it. If not, then you need a different approach, something that can stream the tarball directly to s3 rather than using s3fs, if s3fs always needs temp files, even with the cache disabled. Or, mount an EBS volume or another ephemeral disk if you have one, for the temp files to live on while you need them. – Michael - sqlbot – 2014-04-20T18:32:24.877

I will check with disabling disk cache on s3fs. Meanwhile is employed an adhoc solution where I wrote a code that breaks the larger file into ~4GB chunks (as I had around 5Gb instance space left) and writes them into s3fs. I had to find a quick solution. – user533550 – 2014-04-21T18:53:09.810

No answers