35

I need to deploy an automated process (via 1 min cron script) that looks for tar files in a specific directory. If a tar file is found, it is untarred to the appropriate location and then the tar file is deleted.

The tar files are automatically copied to this server over SSH from another server. In some cases, the tar files are extremely large, with lots of files.

The problem that I am expecting to run into: If it takes > 1 minute for the tar file to be copied to the server, and the cron script runs once every minute, it's going to see the .tar.gz file and try to do untar it, even though the tar file is still in the process of being written to.

Is there any way (via bash commands) to test if a file is currently being written to, or if it's only a partial file, etc?

One alternative I was thinking of was to have the file be copied as a different file extension (like .tar.gz.part) and then renamed to .tar.gz after the transfer is complete. But I figured I'd try to figure out if there is simply a way to determine if the file is whole at the command line first... Any clues?

Jake Wilson
  • 8,494
  • 29
  • 94
  • 121
  • 2
    How exactly is the file being transferred? For example, `rsync` uses a temporary filename during the transfer (by default), and only _after_ the file is completely transferred, renames it to the actual filename. – Piskvor left the building Mar 27 '14 at 15:07

7 Answers7

16

Your best bet is to use lsof to determine if a file has been opened by any process:

#  lsof -f -- /var/log/syslog
COMMAND   PID   USER   FD   TYPE DEVICE SIZE/OFF  NODE NAME
rsyslogd 1520 syslog    1w   REG  252,2    72692 16719 /var/log/syslog

You can't easily tell if it's in the process of being written to, but if it is being written to, it MUST be open.


Edit: let's solve the actual problem here rather than try to implement the proposed solution!

Use rsync to transfer the file:

○ → rsync -e ssh remote:big.tar.gz .

This way, the file won't be copied over top of the existing one but copied into a temporary file (.big.tar.gz.XXXXXX) until transfer is complete, then moved into place.

MikeyB
  • 38,725
  • 10
  • 102
  • 186
14

You are on the right track, renaming the file is an atomic operation, so performing the rename after upload is simple, elegant and not error prone. Another approach I can think of is to use lsof | grep filename.tar.gz to check if the file is being accessed by another process.

Alex
  • 7,789
  • 4
  • 36
  • 51
7

A bit old, but most of the answers completely misses the point of the question:

But I figured I'd try to figure out if there is simply a way to determine if the file is whole at the command line first...

In general, there isn't. You simply don't have enough information to determine that.

Because determining that the file is closed is not the same as determining if the file is whole. For example, a file will get "closed" if the connection is lost partway through the transfer.

Only @Alex's answer got this right. And even he fell for using lsof somewhat.

To determine if the file has been fully, successfully transferred requires more data. Such as:

One alternative I was thinking of was to have the file be copied as a different file extension (like .tar.gz.part) and then renamed to .tar.gz after the transfer is complete.

That's a perfectly fine way to communicate that the file has been fully and successfully transferred. You can also move files from one directory to another as long as you stay within the same filesystem. Or have the sender send an empty filename.done file to signal completion.

But all methods have to rely on the sender somehow signalling that the transfer has completed successfully. Because only the sender has that information.

Some file formats (such as PDFs) have data in them that allow you to determine if the file is complete. But you have to open and read pretty much the entire file to find out.

lsof will just tell you the file is no longer open - it won't tell you why it's no longer open. Nor will it tell you how big the file is supposed to be.

Andrew Henle
  • 1,232
  • 9
  • 11
5

The best way to do this is to use incron ("inotify cron system"). It allows you to set an inotify watch on a directory which will then notify you of file operations. In this case, you should watch the dir for a close_write. That'll allow you to then run your command once the file was closed after a write.

ricmarques
  • 1,112
  • 1
  • 13
  • 23
Kyle
  • 1,589
  • 9
  • 14
2

It seems like lsof can detect what mode a file is open under:

lsof -f -- a_file
COMMAND   PID  USER   FD   TYPE DEVICE SIZE/OFF     NODE NAME
cat     52391 bob    1w   REG    1,2       15 19545007 a_file

See where it says 1w? That means that the file descriptor number is 1 and the mode is w, or write.

1

Using inotifywait can achieve what you're after - it has the capability to wait until a file write has finished before executing a command.

The following will continuously watch a folder for new files and execute the command in the loop when writing to the file has finished.

WATCH_DIR=/directory/to/monitor
DEST_DIR=/x/y/z

/usr/bin/inotifywait --recursive --monitor --quiet -e moved_to -e close_write --format '%w%f' "$WATCH_DIR" | while read -r INPUT_FILE; do

mv "$0" "$DEST_DIR"

done

For more configuration options see https://linux.die.net/man/1/inotifywatch

teeedubb
  • 11
  • 2
0

I use a python script that iterates size check of up to it is the same on 2 iterations in different time (in my case, with 0.05s of diff between checks, the job is done!):

    dict={}
    for filename in os.listdir(basepath+'/in'+stage):

        fullInFilename=myfile

        try:
            if not filename in dict:
                #nuevo item...
                time.sleep(0.05)
                dict = {filename: os.stat(fullInFilename).st_size}
                break
            else:  # ya existe en dict, terminó de copiar?
                time.sleep(0.05)
                sizeRegistrado = dict[filename]
                sizeActual = os.stat(fullInFilename).st_size

                if sizeActual != sizeRegistrado:
                    # sigue copiando...
                    dict[filename] = sizeActual
                    print(sizeActual)
                    break
                else:
                    # Terminada
                    #print("pop!")
                    dict.pop(filename)