0

I have a dataset which is quite big - a few petabytes.

Untarring it into a new filesystem takes a couple days. The problem is that the server does not allow me to have a process running over 24h - it gets killed, whether I want it or not.

This precludes me from using screen or any other sort of process- backgrounding.

I would need to be able to resume the tar process, because it will get killed, no matter what.

  • How is this time limit enforced? How do processes get exempted from that? The obvious answer is to split your tar file into smaller chunks that *can* be done in a day, but it sounds like a silly limitation. – bodgit May 07 '19 at 10:50
  • @bodgit I can't split that file for the very reason I don't have time for it. This is a supercomputer where I work, and no user process is allowed to run longer than that. – Alexandre Strube May 08 '19 at 12:43
  • Your problem is a tar file is a stream of data ((T)ape (AR)chive, remember), you can only find specific files by starting at the beginning and reading through until you find what you're after. So if you wanted to resume extracting from a given file, you'd still have to reread through everything up to that point to find that file. Putting your dataset into a different archive format that allows indexed access or unpacked elsewhere so you can use something like rsync to do the copy is your likely answer. Do the admins of your supercomputer not have an "out of band" way of importing datasets? – bodgit May 08 '19 at 13:06

0 Answers0