Copying files: does Windows write to disk if files are identical?

30

9

I have a directory with a lot of books in PDF format (around 2GB in size).

While reading I often leave notes and annotations in the files. Occasionally I make a backup on an external hard drive. Since I do not remember which files have been modified since the last backup, I just copy and overwrite the whole thing.

Does Windows check if files with the same name are identical (by content) before overwriting? If not, how would I approach doing exactly this?

Doflaminhgo

Posted 2019-10-13T21:37:34.937

Reputation: 411

16No windows copy is not smart enough to do that. There may be some third party copy tools that can. – Moab – 2019-10-13T21:43:49.503

Sync Back Pro (2Brightsparks) can do this. It will only copy changed or new files. It is a third party tool, but top notch and I use it to synchronize 2 large folders totaling 100 GB. It only copies changes – John – 2019-10-13T23:23:28.223

6I use Robocopy robocopy C:\source C:\destination /DCOPY:DAT /R:1 /W:1 for such tasks, it works for me but if you try it, be sure to try it on a source and destination where it does not matter if something goes wrong – SimonS – 2019-10-14T09:53:46.643

and TotalCommander would be a good freeware third party tool for this task too – SimonS – 2019-10-14T10:30:31.710

2Windows files have an "archive" bit that is set on write. Many backup tools will use this. – OrangeDog – 2019-10-14T13:21:13.753

Perhaps look in to a "real" backup solution, to avoid even having to think about such things. – Adam Barnes – 2019-10-15T18:32:58.977

1GoodSync Pro is my goto Windows equivalent to rsync. This does a bitwise diff, which is what you're referring to. – Blairg23 – 2019-10-15T23:03:55.173

Answers

62

Robocopy.

Windows cannot differentiate between identical and modified files if you copy using Windows Explorer.

Windows can differentiate between identical and modified files if you copy using Robocopy, which is a file-copy utility included in Windows (Vista, 7, 8.1 and 10).

There's no need to use third-party tools.

You can save this script as a batch file and re-run it whenever you want to perform a backup:

robocopy /e c:\PDFs p:\PDFs

  • Whenever a PDF file is annotated and the changes are saved, both its Last Modified and Size attributes will change in the source folder, triggering an overwrite of the corresponding file in the destination folder the next time Robocopy is invoked.
  • Robocopy compares the Size and Last Modified attributes of all files that exist in both the source and destination. If either attribute is different, then the destination file will be overwritten. Note: these values only need to be different, not necessarily newer or larger; even if a source file has an older Last Modified or smaller Size attribute, the source will still overwrite the destination.
  • the /e (or /s) switch isn't necessary if all of your PDFs are in the root of the folder referenced in the script but if you want to include Subfolders then use /s. If you want to include subfolders and Empty subfolders, then use /e.
  • I would assign the backup drive a letter further along in the alphabet, so there's no risk of another drive being inadvertently assigned the drive letter used in the script, causing the backup to fail at some point in the future. I used P here for PDF.

That simple script is all you need.

Mr Ethernet

Posted 2019-10-13T21:37:34.937

Reputation: 3 563

9Is it really wise to "skip" failed copies when making backups? – Jacob Raihle – 2019-10-14T14:19:24.010

7Just a side note: Rococopy does not check the content, it only checks metadata. But yeah, in this case I suppose robocopy should suffice. – Albin – 2019-10-14T16:10:46.547

4@Jacob Raihle good question. The retry and wait parameters have default values of 1 million attempts and 30 seconds respectively. You can always use smaller values for both but locked files often require user intervention to unlock and reattempts in the short term generally fail regardless. So this is personal preference but I like to let RC race through all the files without getting hung up for a few minutes on a single one. If a file does fail, this will be reported in the job summary report anyway. In general, I find that tomorrow's backup will scoop up any file that was locked today. – Mr Ethernet – 2019-10-14T19:21:41.733

1overwrite every PDF on your backup drive that doesn't have a matching Last Modified timestamp does that take into account if the destination timestamp is newer than source? It doesn't seem relevant in this question, but still worth knowing. – Gary – 2019-10-15T15:11:49.090

@Gary Robocopy doesn't distinguish between older vs. newer or smaller vs. bigger files, it only cares that they are different. So even if a file in the destination folder is newer and/or bigger, Robocopy will still happily overwrite it with the older and/or smaller file from the source folder. It senses that the files are different and resolves the difference by pulling data across in the same direction every time: source to destination. – Mr Ethernet – 2019-10-16T00:04:44.143

@JacobRaihle I've removed the /r:0. Specifying the number of retries is a matter of personal preference so I've left it at the default. I've never actually seen an automatic retry succeed where the initial attempt failed, particularly with scheduled overnight backups where there is no possibility of users closing any applications that may be locking files. I used /r:5 for years before eventually moving to /r:0 as my go-to. – Mr Ethernet – 2019-10-16T01:15:54.477

@wrecclesham Doing a /r:1 /w:1 is my preferred method. An overly paranoid antivirus can lock the source or destination just long enough for robocopy to fail initially. A retry a second later will usually come through just fine in that case. I agree that more retries and longer timeouts are usually pretty useless. Just slows things down. Don't forget the /MT switch which can greatly improve throughput when dealing with lots of small files. – Tonny – 2019-10-16T10:45:36.180

@Tonny good thinking. If I used antivirus software then I would use /r:1 /w:5 just to give the AV ample time to release its jaws from the file. – Mr Ethernet – 2019-10-16T10:55:18.097

6

Windows does not do this. It will however prompt you to overwrite files with the same name and you can select manually if you want to do it.

For a easier solution, use FreeFilesSync to compare the folders and overwrite only changed/updated files (Mirror option and select File time and size in Comparison Settings).

xypha

Posted 2019-10-13T21:37:34.937

Reputation: 2 434

4

Yes and no! Windows Explorer only checks for metadata (file size, dates etc.).

But you could use a script e.g. powershell (see here) which comes with (most) Windows or 3rd party tools that let you compare/copy files using file checksums e.g. MD5 or SH1 hashing (see here and/or use a search engine).

I myself like to use the software checksum compare (see here), it lets you compare files and directories including file checksums and it works from a USB pen drive.

If you don't need to compare the file's content and if you just want to copy "newer" files you can use any advance copy method like xcopy, robocopy, etc.

Note: the different hashing methods have up and downsides (mainly reliability vs. speed). For me MD5 is more than enough for this type of file comparison, but that's a personal preference. See here for further info on that topic.

Albin

Posted 2019-10-13T21:37:34.937

Reputation: 3 983

MD5 hashing is just asking for it. – user541686 – 2019-10-14T07:51:59.750

1@Mehrdad what do you mean? – Albin – 2019-10-14T07:54:44.770

It's been broken for 15 years. – user541686 – 2019-10-14T08:02:23.857

24@Mehrdad MD5 broken in a security sense. In a deduplicating sense it's ideal, being fast. What's your threat model, when the OP is using it to check the uniqueness of their own files, when they have the originals right there? – Chris H – 2019-10-14T08:12:42.727

1@Mehrdad more like 25 years, yeah I wouldn't use it for password hashing or checking for the integrity of a file but for simple file comparison, it's a fast and easy algorithm. And it was just an example! But technically you are right, although using e.g. SHA1 will slow down the comparison process significantly (I think around 1/5, but dont quote me on that - it all depends on the impementation and the use case). For me it wouldn't be worth it, on the relatively small chance a collision would occur. Anyway, I included your comment in my answer, thanks. – Albin – 2019-10-14T08:33:30.873

1@Mehrdad "Yes and no! Windows Explorer only checks for metadata (file size, dates etc.)." Are you sure Windows Explorer checks dates when copying? I just tested it and it seems to only check the file name. I've never seen it check dates before. – Mr Ethernet – 2019-10-14T09:16:06.973

SHA1 is still going to be considerably faster than reading the files from even a fast SSD. – pjc50 – 2019-10-14T09:56:07.987

10"SHA1 is still going to be considerably faster than reading the files from even a fast SSD." - In order to SHA hash the files, you still need to read them all... – Milney – 2019-10-14T11:50:34.483

2@wrecclesham it sort of does, when it finds conflicts in the file name it shows you changes in size/date and asks you to resolve them manually for a single or for all similar cases – Albin – 2019-10-14T18:11:29.497

3

In short: No

Windows doesn't dot that in a straightforward way.

Well, it does, but like everything in Windows it's ambiguous at best. You will be prompted for name conflicts, and depending on your Windows version, you get a more or less understandable dialog with several options to choose from, with an additional note ("Blah blah, different size, newer"). You can then, one by one, choose whether or not to keep the modified file, and you have the option of applying this to all "identical" matches.
Now of course it's Windows, so you have no guarantee that "newer" actually means newer, and you do not know what is "identical" (is it just the name collision, is it the size change, is it the modification date, or is it everything?).

Alternatives

There exist a huge variety of file sync programs, both free and commercial which are somewhat better insofar as they check whether a file has been modified before overwriting it, rsync being the traditional mother-of-all-tools, but also being a tidbid less user-friendly than some people may wish.
However, I do not recommend any of these because they are not fundamentally making things better.

Personally, if you are not afraid of a little commandline (could always make a batch file!) I'd recommend Matt Mahoney's excellent zpaq. This is basically ZIP, except it compresses much better, and it does deduplication on the fly.

How it that better?

Well, checksum-comparing tools are all nice and that. Especially when you go over the network, nothing can beat rsync running on both ends, it's just awesome. But while a typical sync tool will do the job just fine (and better than Explorer) this is not what it's best at.

Writing to an external drive, whether or not you compare checksums, has a couple of things you need to cope with:

  • Access time on the drive (abysmal)
  • Latency over USB or what you use (getting better but still kinda abysmal)
  • Bandwidth (actually pretty good nowadays)
  • Drive writes (and amplifications)
  • Drive reads

In order to compare checksums, you first have to read in the files. Fullstop. Which means that for a couple of thousand files, you pay for the latency of traversing the directory structure, opening files over a high-latency link, and reading the files several thousand times. Plus, transferring them in small units over a high-latency wire. Well, that sucks big time, it is a very expensive process.

Then, you must write the files that have changed, again with several high-latency operations such as opening files, and overwriting data, and again one by one. This sucks twice because not only is it inherently unsafe (you lose the file being overwritten if your cat stumbles over the USB cable) but also with modern shingling harddrives (such as many external drives), it can be excruciatingly slow, down to single-megabyte-per-second if you are unlucky. That, and the latency of thousands of small transfers adding up.
A well-written file copy tool may be able to deal with the safety issue by copying a temporary file, and atomically renaming it afterwards (but this adds even more overhead!).

Now, an archive format like zpaq will create an archive that contains the checksums of the files already, they can be read in quickly and sequentially from one location. It then locally (locally means "on your side of the cable" where you presumably have a reasonably fast disk connected via SATA or M.2 or something) compares checksums, compresses differences, and append-only writes the compressed data sequentially to the existing archive. Yes, this means that the archive will grow a little over time because you carry a whole history around. Alas, get over it, the cost is very moderate thanks to diffing and compression.

This method is faster, and safer at the same time. If you pull the cable mid-operation, your current backup is interrupted (obviously!). But you do not lose your previous data. All transactions that go over "slow" links are strictly sequential, large transfers, which maximizes throughput.

Damon

Posted 2019-10-13T21:37:34.937

Reputation: 4 002

3Robocopy is included in all versions of Windows since Vista and will do exactly what the OP wants. So I would say, "In short: yes... if you ask it nicely!" – Mr Ethernet – 2019-10-14T09:25:28.263

@wrecclesham: While it is true that robocopy will do the job, it is nowhere near something that someone with "normal" skill can easily grok, nor is it well-suited for the task because it is susceptible to exactly the high-latency-link problem that I pointed out (in particular because robocopy does extra work in order to be reliable, i.e. copy-then-rename). – Damon – 2019-10-14T09:31:02.470

2>

  • Super User describes itself as a "Q&A for computer enthusiasts and power users". A simple Robocopy script isn't going to confuse the target audience of this site. I don't think you're giving Super User's userbase enough credit! 2) Your "high-latency link" theory doesn't apply here. The OP is backing up a tiny amount of data via USB, not via some high-latency WAN link. Only a relatively small number of modified PDFs will be included. I use Robocopy to back up ~1 TB of similar files to a USB drive and subsequent nightly runs, including only modified files, take less than 1 minute.
  • < – Mr Ethernet – 2019-10-14T10:25:42.647

    @wrecclesham: I consider myself a computer enthusiast (with ~34 years of practice) but I wouldn't want to use robocopy. Alas, different people feel comfortable with different tools. But my point remains: While it is true that RC will only write a tiny amount, it must still read a lot, or rely on file modification times alone, which isn't safe (or USN journals, which isn't guaranteed to be present at all, or pristine). Its work is strictly non-sequential which means that latency adds up. USB latency is very noticeable (millisecond range). "Milli" multiplied with "many" is significant. – Damon – 2019-10-14T10:39:25.020

    "it must still read a lot... or rely on file modification times alone". Robocopy only checks metadata, which means that it hardly needs to read any data at all for skipped files, which allows it to handle identical files very efficiently. If the OP highlights a PDF, the file's timestamp will change. Unmodified PDFs will have identical Last Modified timestamps in both places. Robocopy can therefore differentiate between files to be skipped and those to be overwritten. You make an interesting point but I'm not sure I understand where the inherent risk is here. What could theoretically go wrong? – Mr Ethernet – 2019-10-14T10:52:08.620

    @wrecclesham: Metadata isn't reliable. Some programs (notably "copy and archive" tools, but possibly others) deliberately tamper with them, and Explorer is very inconsistent. If I copy files around on my Win7 system on "real" disks (including iSCSI), copy gets create time "now", mod time of original, access time not changed. SMB-SMB gives identical modify time and access time changed. SMB-iSCSI is the same as HD-HD, but USB-stick is something else. Plus, some filesystems (e.g. FAT) don't even have proper resolution. So I don't know what could all go wrong, but... a lot! – Damon – 2019-10-14T11:26:04.207

    Let us continue this discussion in chat.

    – Mr Ethernet – 2019-10-14T11:48:39.417

    @Damon Hey there, although I went with wreccleshams answer as the accepted one (as RC will probably do the job just fine and seems to be accepted by the community), I wanted to say that your write-up is also very much appreciated especially considering the effort you put into it :) – Doflaminhgo – 2019-10-14T13:08:38.273

    1

    XCOPY/D will only copy files if the source is newer than the destination. (XCOPY/S/D for a recursive copy)

    Daniel Klugh

    Posted 2019-10-13T21:37:34.937

    Reputation: 29

    XCOPY does not take the contents of the files into consideration – Bert – 2019-10-16T13:24:49.590

    0

    Microsoft SyncToy

    Microsoft makes a great Windows PC app called SyncToy where you specify pairs of "left folder and right folder", and choose to echo changes from left to right, contribute changes from left to right without deleting right, or synchronize between left and right. There is a user interface for previewing the changes before committing.

    If a file is detected as identical, it will be skipped over, which is the functionality that you are looking for.

    I have been using the Echo mode for about 10 years to incrementally mirror changes from my desktop PC to an external drive.

    https://www.microsoft.com/en-us/download/details.aspx?id=15155

    StalePhish

    Posted 2019-10-13T21:37:34.937

    Reputation: 101