How does SSD meta-data corruption on power-loss happen? And can I minimize it?

Question

Note: This is a follow-up question to Is there a way to protect SSD from corruption due to power loss?. I got good info there but it basically centered in three area, "get a UPS", "get better drives", or how to deal with Postgres reliability.

But what I really want to know is whether there is anything I can do to protect the SSD against meta-data corruption especially in old writes. To recap the problem. It's an ext4 filesystem on Kingston consumer-grade SSDs with write-cache enabled and we're seeing these kinds of problems:

files with the wrong permissions
files that have become directories (for example, toggle.wav is now a directory with files in it)
directories that have become files (not sure of content..)
files with scrambled data

The problem is less with these things happening on data that's being written while the drive goes down, or shortly before. It's a problem but it's expected and I can handle that in other ways.

The bigger surprise and problem is that there is meta-data corruption happening on the disk in areas that were not recently written to (ie, a week or more before).

I'm trying to understand how such a thing can happen at the disk/controller level. What's going on? Does the SSD periodically "rebalance" and move blocks around so even though I'm writing somewhere else? Like this:

And then there is a power loss when D is being rewritten. There may be pieces left on block 1 and some on block 2. But I don't know if it works this way. Or maybe there is something else happening..?

In summary - I'd like to understand how this can happen and if there anything I can do to mitigate the problem at the OS level.

Note: "get better SSDs" or "use a UPS" are not valid answers here - we are trying to move in that direction but I have to live with the reality on the ground and find the best outcome with what we have now. If there is no solution with these disks and without a UPS, then I guess that's the answer.

References:

Is post-sudden-power-loss filesystem corruption on an SSD drive's ext3 partition "expected behavior"? This is similar but it's not clear if he was experiencing the kinds of problems we are.

EDIT: I've also been reading issues with ext4 that might have problems with power-loss. Ours are journaled, but I don't know about anything else.

Prevent data corruption on ext4/Linux drive on power loss

http://www.pointsoftware.ch/en/4-ext4-vs-ext3-filesystem-and-why-delayed-allocation-is-bad/

this [pdf document](https://cseweb.ucsd.edu/~swanson/papers/DAC2011PowerCut.pdf) might have some informations. Check chapter 4.1.2 — A.B, Jul 30 '18 at 17:27
@A.B - very interesting - thanks! if you want to copy/summarize that section into an answer I'd be happy to upvote it. — Yehosef, Jul 30 '18 at 20:17
reading a bit slower I see that the pages are the 1st and 2nd bit pages, not what I thought. So that's not 4.1.2 that matters, and I'll certainly not write an answer, I don't have much knowledge on the subject — A.B, Jul 30 '18 at 23:36
@A.B - can you explain more about why you think that's not what's happening. This is the closest hint I've found and sounds like it could explain what's happening. https://www.usenix.org/system/files/conference/fast13/fast13-final80.pdf doesn't talk about "retroactive data corruption" but talks about metadata corruption, which might look the same. — Yehosef, Aug 01 '18 at 10:17
i thought page1 and page2 are two separate blocks. while instead they are bitplanes both logically and physically of the same data in MLC ( each 2bits binary value xy takes its x from plane1 and y from plane 2, and writting these two bits requires writting plane1 before plane2) — A.B, Aug 01 '18 at 10:40
ok - we'll you understand it much better than me. Do you think the metadata corruption in the other article I referenced would explain the issues I'm seeing? — Yehosef, Aug 01 '18 at 10:43
that's the thing I would think of too, but there's not much information. — A.B, Aug 01 '18 at 11:00
you most likely need the internal procedure what happens when data actually reaches the controller of the SSD, getting that might be possible under some sort NDA with a high enough price sticker, or depending on the manufacturere might even be available, though from my understanding exactly the ssd controller logic is the "IP" in that market and mostly hidden as much as possible. — Dennis Nolte, Aug 02 '18 at 09:47
Yes, the SSD periodically moves data around. It's called garbage collection. — Michael Hampton, Aug 02 '18 at 11:40
@MichaelHampton - Thanks so much - I didn't realize that happens. That makes the corruption issues much easier to understand. If you want to write that into an answer I'll be happy to upvote it. — Yehosef, Aug 02 '18 at 11:49

score 2 · Answer 1 · answered Aug 02 '18 at 08:59

2

Your best bet is to disable write caching on the disk both by telling the disk not to do write caching (look at hdparm and smartctl options and hope the disk honors them) and to make the OS not buffer writes with mount options like sync and dirsync.

answered Aug 02 '18 at 08:59

Baruch Even

1,043
6
18

1

your option is valid to make certain that the writes reach the SSD, but wont help if the ssd controller itself is caching or doing some other $magic before writing, which is how i read that question and A.B. and Yehosef were talking in the comments about. – Dennis Nolte Aug 02 '18 at 09:45
2

That's why I recommend also to tell the disk to not write cache internally and to pray that it actually listens. I know some SSD firmwares will ignore this request and you are left with no guarantee. At that stage there is nothing that you can do to protect yourself besides replacing the SSD which OP says he can't. – Baruch Even Aug 02 '18 at 10:19
3

To make it ultimately clear, if the disk doesn't honor the request to avoid internal write caching there is nothing that anyone can do to make things work in the face of power outages. The disk is simply not suitable for the task at hand. – Baruch Even Aug 02 '18 at 10:21
Thanks @BaruchEven. Does "your best bet" mean "with write caching enabled, you'll see these kinds of errors so you should disable it," or "we don't know what's going on - may as well try that"? I'm fine with either - I just wasn't sure. – Yehosef Aug 02 '18 at 11:35
1

As per above comments, if you do not disable it nothing will help you fully, it will only reduce the impact level slightly. Even if you do the disable write cache on the disk, the disk may not honor that and then you have no solution to your problem with your current set of constraints (i.e. not replacing the disk) – Baruch Even Aug 02 '18 at 12:33

score 2 · Accepted Answer · answered Aug 09 '18 at 15:34

2

For how metadata corruption can happen after an unexpected power failure, give a look at my other answer here.

Disabling cache can significantly reduce the likehood of in-flight data loss; however, based on your SSDs, data-at-rest remain at risk of being corrupted. Moreover, it commands a massive performance loss (I saw 500+ MB/s SSDs to write at a mere 5 MB/s after disabling the private DRAM cache).

If you can't trust your SSDs, the only "solution" (or, rather, workaround) is to use an end-to-end checksumming filesystem as ZFS or BTRFS and a RAID1/mirror setup: in this manner, any eventual single-device (meta)data corruption can be recovered from the other mirror side by running a check/scrub.

answered Aug 09 '18 at 15:34

shodanshok

44,038
6
98
162

Thanks! This may be an ignorant question but the machines we have an SD slot - would it be possible/better to put the OS and application files (not the database/logs) on a fast/high-end SD card? – Yehosef Aug 09 '18 at 21:21
Would a read-only partition of an SSD also be ask risk? – Yehosef Aug 09 '18 at 21:22
@Yehosef yes, the corruption can affect even a read-only partition, at least theoretically. Using an SD for the OS can be a good idea, especially if you have two slot to combine in a RAID1 array, but you should test it thoroughly before deploying it. – shodanshok Aug 09 '18 at 21:49

How does SSD meta-data corruption on power-loss happen? And can I minimize it?

2 Answers2

Linked