What scale of data loss or corruption do I risk if I enable the write buffer on a file server?

Question

I have found plenty of articles online warning of risk of data loss or corruption for drives with write-buffer enabled in the event of power loss. However, I haven't found any that actually refer to the scale of the risk.

I'm looking to build a mirrored file server in Storage Spaces on Windows Server 2016 for the purposes of a small video-editing office. Performance is very important (hence the write-buffer consideration), and our server would handle mostly two types of important writes: Uploading footage, and saving a project or document file.

This leads me to wonder what the worst case scenario would be in the event of unexpected power loss.

For uploading footage, I would expect any interruption to the server to cause a visible network failure to any file transfer in progress. Therefore, unless the power failure occurred seconds after the network portion of the file transfer completed, they would be aware of the need to restart the file transfer once the server was back online. Since I would be aware of the server going down, I could advise the office to use a sync program to presumably overwrite any corrupted files with the local master copies.

As for saving documents and project files, most of them should be so tiny as to have minimal risk of even being in the buffer at the time of failure. And if that wasn't the case, having autosaves or an open version still on the user's computer would give them a second chance. The only risk I can really see is if the power failure occurred right as they saved and closed the file, and that program didn't store rolling autosaves.

Is my assessment accurate, or have I overlooked something? Can corruption in this situation affect more data than that which was being written?

Thanks

Edit: I should stress that I'm not particularly looking for conclusions about what I should do in this scenario. I merely want to properly understand the possibilities so I can make an informed decision on the reality of this risk.

The many web pages I've read on the issue so far have been frustratingly ambiguous, particularly in regards to differentiating between 'write caching' and 'write-cache buffer flushing'.

The obvious solution here is a UPS. Even without write cache you shouldn’t be running without one that is configured to shut the server down properly in the event of an extended power outage. — Appleoddity, Nov 20 '17 at 01:43
A UPS is certainly the first port of call, and in my plans. However there are multiple points of failure after that, including the PSU. I guess my question is in regards to informing my decision on how much preparation I should take. — Cyanara, Nov 20 '17 at 02:14
That's a tough call. Generally, a UPS is suitable for most cases. How many times have you known an OS or computer to crash after a power failure? Not many, but most of them do use write-caching. I'm not saying take chances, because there inevitably can be data loss, but in the rare event of an actual hardware failure or something it should be minimal. You'll also have a robust back up plan, of course. :) Some more advanced RAID cards have a battery on them to store the cache in memory until power is restored. — Appleoddity, Nov 20 '17 at 03:02
Yeah, that's a fair assessment in my mind. I'll be weighing it all up once I've had the chance to benchmark the new storage with and without this setting. — Cyanara, Nov 20 '17 at 03:24

John Mahowald · Answer 1 · 2017-11-28T05:10:25.153

3

Can corruption in this situation affect more data than that which was being written?

Yes. The writes could be updating the file system itself. The worst case is data loss on basically any file. The warning is unspecific because literally anything could be lost, and the impact varies depending on the application.

Doesn't hours of recovering from a data loss event hamper user productivity? Take this advice and don't disable write-cache buffer flushing.

A better solution: get more and faster solid state storage until you have satisfactory performance.

Edit: to be clear, I am referring to the more aggressive option "turn off write-cache buffer flushing". "Enable write caching", on by default for many kinds of internal disks, is usually an acceptable compromise because Windows is attempting to flush the buffer and harden the writes.

edited Nov 28 '17 at 05:10

answered Nov 20 '17 at 01:27

John Mahowald

30,009
1
17
32

Interesting, thank you. Do you know of any pages where I can read up on that? I had read that link already, btw. What I found unsatisfying about it is that it didn't seem to distinguish between the two write cache options. The first is "Enable write caching on the device". This is the one my question relates to. The second option is "Turn off Windows write-cache buffer flushing on the device", and I have no idea why I would ever even tick that in the first place. – Cyanara Nov 20 '17 at 02:18
1

actually the risk is the entire volume's MFT, not just the file, if the journal says write 5 was complete (but wasn't) and tries to roll back write 6, you're toast – Jim B Nov 21 '17 at 03:04
These worst case scenarios are all going to take the network share offline straight away, I assume? In which case I should have the option of doing a recovery from one of the backups. Is there a middle ground where silent corruption could occur to unexpected data? – Cyanara Nov 22 '17 at 09:47
1

No, data corruption does not necessarily take the volume offline. There might be I/O errors, but it is quite possible the file system is intact enough and no one notices until they try to use a corrupt file. Data loss is such a nasty suprise that no one will recommend turning off write-cache buffer flushing unless the storage system is separately battery backed and assures writes will be committed to disk. – John Mahowald Nov 28 '17 at 04:43

score 1 · Accepted Answer · answered Nov 28 '17 at 15:49

You had to distinguish between enabled write buffer and disabled buffer flushes. To fully understand the difference, let's start from the basic.

HDDs and SSDs almost universally have a private DRAM cache used to briefly store and coalesce incoming writes, greatly speeding up their write performance. As a reference, consider that a fast SATA SSD pumped >500 MB/s of sequential writes with its buffer enabled, and only ~5 MB/s with the buffer disabled. HDDs show less severe performance degradation, but still.

At the same time, if these private DRAM caches are not powerloss-protected, severe data corruption (up to losing the entire filesystem) can happen. To prevent this issue without totally destroying performance, some possibilities exists:

use drives with powerloss protected write caches (ie: enterprise SSD and some newer NV-enabled mechanical HDD)
use an hardware RAID controller with powerloss-protected cache, disabling the private disk's DRAM cache
use cheap consumer hardware with unprotected DRAM cache enabled, but issuing periodic flushes to guarantee filesystem (but not data, as the performance impact would be very big) consistency.

When using software-RAID like approaches (ie: Linux MDRAID, ZFS, Storage Spaces, ecc) you should never disable disk caches, unless you are ready to pay a very high performance cost. Rather, your best bet is to leave write cache enabled and let your OS/filesystem free to issue DRAM sync/flushes commands whenever it wants. In this manner, you gain the performance speedup of the enabled cache without risking to nuke your entire filesystem. Please note that application data are not automatically protected: any application wanting to ensure data durability must issue periodic flushed itself (databases are a good example).

On the other hand, you should NEVER disable DRAM cache flushing, unless you are 200% sure your drives/RAID card have a protected writeback cache. However, in this case, leaving flushes enabled would do no big harm, as almost any recent drive/card simply ignores flushes when its protected DRAM cache is in a healthy state.

So in regards to consumer hard drives, you're saying that as long as I don't disable buffer flushes, having the write buffer enabled will only be risking individual file corruption (as opposed to file system corruption)? — Cyanara, Dec 08 '17 at 01:17
Yes. More precisely, you risk losing **unsynchronized** writes. As filesystem metadata **are** synced **and** journaled, they remain consistent in spite of enabled disk buffer. — shodanshok, Dec 08 '17 at 10:57

What scale of data loss or corruption do I risk if I enable the write buffer on a file server?

2 Answers2