Most side effect-free way to archive a large file while it is being modified

Question

I am looking after an application that generates a large amount of data in a log file (about 5G a day) on a Red Hat server. This process runs for 24 hours during the week, so there is no point in the day when the file is not being modified, although the information being added to it over midnight is not particularly important, so it is OK if I lose let's say a few seconds of data during that period.

In order to make "safe" archives of the log file on a daily basis, I've created a script that does the following at some point in the early morning:

makes a copy of the file to a local folder
truncate the "active" file
tar+compress the copy
move the tar.gz of the copy to our archive space

Here is the script itself in case there are glaring problems with it:

DF=$(date +"%Y%m%d_%H%M%S")
TARGET="fixdata-logs-$DF"
cp -r ./fixdata/logs $TARGET

#Truncate the original log file
find ./fixdata/logs -name '*.log' -exec sh -c 'cat /dev/null >| {}' \;

#Zip the log files
tar -zcvf $TARGET.tar.gz $TARGET

#Delete the labelled copy
rm -rf $TARGET

#Archive files older tha 3 days
find . -type f -mtime +3 -name \*.gz -exec mv {} $ARCHIVE_DIR \;

(I understand that some data may be lost but this script is run over a time where a few seconds of data loss is not important.)

Problem is, during this period the application frequently reports errors that is related to system resources. For example, the heartbeat monitor of its queue often fail to produce regular heartbeats. It is clear that this copy->tar.gz->move process is causing enough of an impact on server IO that it impacts application behaviour.

How can I reduce the impact of this script? Time-to-finish is not important - if a solution takes longer but does not cause application errors, then that' preferable to something that is quick. Are there other approaches that I should consider?

For completeness, I have considered the following but have doubts:

Skip the copy part and tar directly: But I'm worried that tar will have issues if the file is being modified while it is busy.
Copy to the archive folder first and then tar - perhaps if the compression is done on a different disk then the IO impact is less? I'm worried that the archive space we use is not suitable for doing input/output disk operations like compression as I don't think it's a traditional random access disk. I'm also not sure if a copy to a different physical disk is not going to make things worse, as I would have thought the OS has some clever way to make a local copy without physically reading all the bytes in the file. Unfortunately my *nix skills is not helping here.
Wait until the weekend: unfortunately the disk space on the server is not sufficient to contain a week's worth of data before archiving. Of course I can ask to increase it but first I want to see if there are more sane solutions.

womble · Accepted Answer · 2015-09-08T09:44:22.697

You can do this with significantly reduced I/O load by not doing the copy+truncate. Instead, rename the file, then, if the process holds the log file descriptor open, do whatever is required to cause it to recycle its log descriptors (typically sending a HUP is the canonical way of doing that). If the program doesn't already have that capability, then patch it so that it does.

By doing this, you won't have the I/O overhead of a copy on the same media (which is a simultaneous read+write), then the truncate (which may or may not be a significant load, depending on your filesystem), and then the read to tar/compress and the write load to make the archive.

Once you've renamed the log files, you can tar/compress/whatever at your leisure. To reduce I/O load further, consider doing the write side of the tar/compress direct to the archive storage -- while your archive storage may not be a typical random-access device, it'll still take a straight-up stream of data that's being compressed on the fly (even S3 can do that, with the right CLI tool).

The other thing to consider, orthogonal to the above, is using ionice. By running a program as ionice -c 3 <command>, you drop the I/O priority of the process to "idle-only" -- that is, if there is anything else on the system that wants to do I/O, your program will be bumped. This is a neat idea, but it can bite you in the behind if you've got a heavy I/O system (your program could take aaaaages to complete, because it rarely gets I/O time). In cases where you're doing such an excessive amount of unnecessary I/O already, making it "idle-only" priority is going to make the problem that much worse.

I'm also strongly suspicious that idle-only scheduling doesn't quite do what it says on the tin; I've seen small slowdowns in performance on other ("best-effort" scheduled) processes when "idle-only" programs are running, compared to when the "idle-only" process isn't running. I suspect it happens because when a program asks for I/O while the "idle-only" process is in the middle of doing an I/O operation, there's a delay until that I/O is done before the "best-effort" process' I/O operation can start. That is to say, it's still a lot better than if the "idle-only" process was running with "best-effort" priority, but it isn't the wonderous fix-all that it might seem at first glance.

Nice answer! +1 (if I had the reputation), and I will take to heart the part about compressing directly to the archive. But it also involves messing with the application a bit, which may be a big deal for someone with my skill set. Another thing I've come across since is *ionice -c 3* - it may not reduce the I/O, but it will schedule it to reduce impact on system resources. Do you know if ionice could be a quick-win solution to the problem? — notdazedbutconfused, Sep 08 '15 at 09:33
D'oh! I originally started to answer the question on the basis of using `ionice`, and then got sidetracked with logging. I will update the answer. On the topic of application modification, that's something you can very easily palm off onto someone who *does* have the necessary skills; either in-house devs, the vendor, or (for open source) any contractor. — womble, Sep 08 '15 at 09:37

BabyRage · Answer 2 · 2015-09-08T09:24:55.553

-1

Have a look at the logrotate linux utility which is available in rhel, it has compression, copytruncate and various other options and it also deals with log files that is in use by the applications just like you have. You can also try to use a ssd disk and copy the data onto that which should be the quickest and although it will still use cpu the io to a slow disk would be eliminated as long as you don't use usb.

edited Sep 08 '15 at 09:24

answered Sep 08 '15 at 09:00

BabyRage

31
3

1

That won't do anything to reduce the volume of I/O required to complete the process. – womble Sep 08 '15 at 09:05

Most side effect-free way to archive a large file while it is being modified

2 Answers2