Copied directory appears to become larger at destination

2

0

I have the following code as part of a shell script:

while [ $(ps -ef | awk '{print $2}' | grep -F "$CPPID") ]; do
    sleep 10
    awk -v "usbsize=$(/bin/df | grep -F $DEVICEMOUNTPOINTQ | awk '{print $3}')" -v "isosize=$(/bin/df | grep -F $ISOMOUNTPOINTQ | awk '{print $3}')" 'BEGIN { printf "%.1f", 100 * usbsize / isosize }' && echo "% copied..."
done

This is monitoring cp doing the following operation:

cp -a "$ISOMOUNTPOINT"/* "$DEVICEMOUNTPOINT"

And this works fine for the most part, until

90.5% copied...
94.2% copied...
97.8% copied...
101.6% copied...
102.7% copied...

Why does this exceed 100% of the size of the source? The copy is from a loop-mounted ISO to a NTFS-formatted partition on a USB flash drive. I'm guessing this is probably a filesystem thing?

What is my example missing to make the sizes match up, so that when cp completes it is 100% copied, not 103%?

Thanks.


Re: Bounty

I will award the bounty to the first person to produce a solution similar to the above code that meets the following criteria:

  • The script must be able to detect copying at a 1:1 ratio
  • The script must not display a value in excess of 100% copied, however...
  • The script must not simply cap the display at 100% copied when it exceeds it.

If the data size does indeed differ from source to destination for some reason, then I'd like a script that notices this and still displays the actual ratio copied.

Matthieu Cartier

Posted 2011-01-31T16:53:29.440

Reputation: 3 422

its unclear of what you want in the end: just copy files via commandline and give progress? how does that relate to the title of your question? – akira – 2011-01-31T17:00:56.520

I have a working setup which already gives the progress, however it exceeds 100% copied. I want to know why that is, and how to reach one of the goals specified at the end. – Matthieu Cartier – 2011-01-31T17:02:12.033

Answers

1

Here is your code simplified and made more readable:

while ps -p $CPPID > /dev/null
do
    sleep 10
    usbsize=$(/bin/df $DEVICEMOUNTPOINTQ | awk 'NR == 2 {print $3}')
    isosize=$(/bin/df $ISOMOUNTPOINTQ | awk 'NR == 2 {print $3}')
    awk -v "usbsize=$usbsize" -v "isosize=$isosize" 'BEGIN { printf "%.1f%% copied...\n", 100 * usbsize / isosize }'
done

Your last awk line could be replaced by these two:

    percent=$(echo "$usbsize / $isosize * 100" | bc -l)
    printf "%.1f%% copied...\n" $percent

Then you could do this just before that printf statement:

if (( $(echo "$percent > 100" | bc) == 1 ))
then
    break
fi

and add wait $CPPID just after the end of the while loop. That will stop printing progress once 100% is reached.

See Process Management regarding the reliability of PIDs (they get recycled).

The problem you're seeing is probably due to using the "used" value of the destination filesystem rather than the difference in the current value from the start value.

Try adding a line like this before the while loop:

startsize=$(/bin/df $DEVICEMOUNTPOINTQ | awk 'NR == 2 {print $3}')

and change the line inside the loop to:

usbsize=$(/bin/df $DEVICEMOUNTPOINTQ | awk -v "start=$startsize" 'NR == 2 {print $3 - start}')

Of course this might all be avoidable if you used rsync --progress instead of cp.

Edit:

Also, try this in the while loop as shown above to see what the numbers being used in the calculation are. That might provide a clue as to what's going on:

    awk -v "usbsize=$usbsize" -v "isosize=$isosize" 'BEGIN { printf "%d of %d, %.1f%% copied...\n", usbsize, isosize, 100 * usbsize / isosize }'

Paused until further notice.

Posted 2011-01-31T16:53:29.440

Reputation: 86 075

Thanks for the help, however, this needs to be portable -- bc isn't even installed on my system by default, so I'm inclined to say I can't consider it "portable". I will however try the changes you listed in a few minutes (going away for a few). Thanks! :) – Matthieu Cartier – 2011-01-31T18:04:50.103

@neurolysis: bc is just another option for doing float math which can result in shorter commands. I've just taken for granted that it's ubiquitous. AWK is fine for that purpose. – Paused until further notice. – 2011-01-31T18:12:27.660

This seems to be printing nan% for me. Can you reproduce this? – Matthieu Cartier – 2011-02-01T17:19:53.653

@neurolysis: Which part is doing that? – Paused until further notice. – 2011-02-01T18:04:06.173

I tried using the first snippet, and that occurs. – Matthieu Cartier – 2011-02-02T12:35:59.567

1@neurolysis: What do /bin/df $DEVICEMOUNTPOINTQ and /bin/df $ISOMOUNTPOINTQ output? Is it one header line followed by something like /dev/xxx 1111 2222 3333 44% /mountpoint? If you add echo "[$DEVICEMOUNTPOINTQ] [$ISOMOUNTPOINTQ]" before the AWK command what does it output? By the way, I have a typo in that snippet. There's an extra double quote at the end. Try removing that. – Paused until further notice. – 2011-02-02T13:41:54.813

I'll check that tomorrow. For now, going to add a bounty for someone that can create a snippet that doesn't go over 100% without simply capping it to 100%. – Matthieu Cartier – 2011-02-04T01:46:36.470

@neurolysis: The suggestion at the end of my answer is intended to work properly without capping. – Paused until further notice. – 2011-02-04T01:53:11.640

I tested that prior to receiving nan, I still received unusual values. Should have mentioned that. – Matthieu Cartier – 2011-02-04T01:55:14.787

neurolysis: What Unix/Linux/BSD are you running that doesn't have bc installed by default? Its a POSIX standard program even. I would think you'd be having other problems too. – deltaray – 2011-02-06T20:12:41.160

@deltaray: You have to say @neurolysis so the person is notified of your comment. – Paused until further notice. – 2011-02-06T20:35:23.920

4

My first through is that it would largely depend on the type of files in the source directory. I would think that the likely culprit are sparse files. A sparse file is one where where stat.st_size != (stat.st_blksize * stat.st_blocks); that is, the overall size of the file is larger than the number of data blocks associated with the file's inode. Any unallocated blocks are read as a block of zeros by the system calls. So when you use cp(1) on a sparse file, the destination file will contain more blocks (containing only zeros) than the source file. The du(1) and df(1) commands look at the number of blocks, not the size of the file(s). Core files are often created as sparse files since they may need to map memory. This type of file is useful for creating disk images, for example creating a virtual host's drive that is of size 15GB. It would be very wasteful to allocate all the blocks at the time of creation; the size (st_size) could be 15GB, but the actual number of blocks could start at 0.

This is just one type of file that could explode when copied. Without knowing what you have in your filesystem, it's hard to say what else might be doing that.

Arcege

Posted 2011-01-31T16:53:29.440

Reputation: 1 883

The filesystem is literally clean (the shell script also removes the MBR including partition table and sets up a new NTFS partition from scratch). Thanks for the info! – Matthieu Cartier – 2011-01-31T18:28:02.097

2

You can use rsync in local-only mode, where both the source and destination don't have a ':' in the name, so that it behaves like an improved copy command. With the progress parameter, it displays something similar to this (source) :

$ rsync -r -v --progress -e ssh root@remote-server:~/pictures /home/user/
receiving file list ...
366 files to consider
pictures/IMG_1142.jpg
 4400662 100%   32.21kB/s    0:02:13 (xfer#31, to-check=334/366)
pictures/IMG_1172.jpg
 2457600  71%   32.49kB/s    0:00:29

As this doesn't give the total percentage, another solution might be to use this script (source) :

#!/bin/sh
cp_p()
{
strace -q -ewrite cp -- "${1}" "${2}" 2>&1 \
  | awk '{
    count += $NF
        if (count % 10 == 0) {
           percent = count / total_size * 100
           printf "%3d%% [", percent
           for (i=0;i<=percent;i++)
              printf "="
           printf ">"
           for (i=percent;i<100;i++)
              printf " "
           printf "]\r"
        }
     }
     END { print "" }' total_size=$(stat -c '%s' "${1}") count=0
}

In action:

% cp_p /mnt/raid/pub/iso/debian/debian-2.2r4potato-i386-netinst.iso /dev/null
76% [===========================================>                    ]

You can also have a look at move files with progress bar that details how to add to cp and mv the -g switch to show progress.

harrymc

Posted 2011-01-31T16:53:29.440

Reputation: 306 093

I have considered using strace, but the overhead is ridiculous. – Matthieu Cartier – 2011-02-04T22:00:17.507

I am not sure about the overhead. This is a nice script, maybe worth trying. – harrymc – 2011-02-05T08:38:34.927

It's simply not feasible to use something which appears to induce about a 30% slowdown in copying. But thanks. – Matthieu Cartier – 2011-02-09T16:55:22.327

Standard red-face saver: Why not simply cap in your script the percentage at 99.9% until it terminates? Nobody will ever figure out why it's hesitating a bit at 99.9. – harrymc – 2011-02-09T17:17:24.417