18

We have a group of consumer terminals that have Linux, a local web server, and PostgreSQL installed. We are getting field reports of machines with problems and upon investigation it seems as if there was a power outage and now there is something wrong with the disk.

I had assumed the problem would just be with the database getting corrupted, or files with recent changes getting scrambled, but there are other odd reports.

  • files with the wrong permissions
  • files that have become directories (for example, index.php is now a directory)
  • directories that have become files
  • files with scrambled data

There are problems with the database getting corrupted, but that's something I could expect. What I'm more surprised about is the more basic file system problems - for example, permissions or changing a file into directory. The problems are also happening in files that did not recently change (for example, the software code and configuration).

Is this "normal" for SSD corruption? Originally we thought it was happening on some cheap SSDs, but we have this happening on a name-brand (consumer grade.)

FWIW, we are not doing autofsck on unclean boot (don't know why- I'm new). We have UPSs installed in some locations, but sometimes it's not done properly, etc. This should be fixed, but even then people can power-down the terminal uncleanly, etc. - so it's not fool-proof. The filesystem is ext4.

The question: is there is anything we can do to mitigate the problem at the system-level?

I found some articles referring to turning off the hardware cache or mounting the drive in sync mode, but I'm not sure if that would help in this case (metadata corruption and non-recent changes). I also read a reference about mounting the filesystem in read-only mode. We can't do that because we need to write, but we could make a read-only partition for the code and configuration if that would help.

This is an example of a drive sudo hdparm -i /dev/sda1:

Model=KINGSTON RBU-SMS151S364GG, FwRev=S9FM02.5, SerialNo=<deleted>
Config={ Fixed }
RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=0
BuffType=unknown, BuffSize=unknown, MaxMultSect=16, MultSect=16
CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=125045424
IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
PIO modes:  pio0 pio3 pio4
DMA modes:  mdma0 mdma1 mdma2
UDMA modes: udma0 udma1 udma2 udma3 udma4 udma5 *udma6
AdvancedPM=yes: disabled (255) WriteCache=enabled
Drive conforms to: Unspecified:  ATA/ATAPI-3,4,5,6,7
user
  • 4,267
  • 4
  • 32
  • 70
Yehosef
  • 1,245
  • 2
  • 9
  • 10
  • 6
    You can buy better SSDs. Typical enterprise SSDs have built in capacitors to provide enough power to the device to finish writing out in-flight data in the event of a power failure. The money you save by not having to recover from a totally scrambled filesystem will easily justify the modest additional cost. – Michael Hampton Jul 29 '18 at 12:50
  • @MichaelHampton - we have >6k installations already. It's not simple to replace them all, especially when most are not failing (either the UPS is working or they are not having power issues.) – Yehosef Jul 29 '18 at 13:46
  • 1
    Well, nobody said you had to replace _all_ of them. But you could use the better SSDs for replacements and/or new installations. – Michael Hampton Jul 29 '18 at 13:49
  • Many of the drives that are failing are https://www.kingston.com/en/ssd/system-builder - I'm not sure if these are considered "cheap". But almost all SSDs can scramble data see (https://www.usenix.org/system/files/conference/fast13/fast13-final80.pdf) . My question is, for ones we have in the field and are not going to replace, is there anything we can do, to minimize the damage? – Yehosef Jul 29 '18 at 13:56
  • 2
    " It's not simple to replace them all" -it totally is. Start by telling the guy makiong the purchase decision he is liable for the cost due to gross neglect and incompetence Someone did some quite substantial mistake by not being borderline competent. – TomTom Jul 29 '18 at 14:09
  • 7
    `WriteCache=enabled`. This is a huge problem. Write cache should *never* be enabled on hard drives that have a database. Some vendors, HP for example, actually prevent enabling hard drive write caching for this very reason. – Greg Askew Jul 29 '18 at 14:24
  • @GregAskew - I hear the point and I agree. The issue is that a large fraction of the failure cases are not connected to the database corruption, but the weird FS issues. Would disabling the write-cache help in those cases (meaning, not in-flight writes)? – Yehosef Jul 29 '18 at 14:53
  • @TomTom - If I could change the reality, I would. But I'm trying to mitigate the problems with the current setup. Telling me to change it, doesn't answer my question. Unless your answer its "it's hopeless - there is nothing you can do." – Yehosef Jul 29 '18 at 14:56
  • @Yehosef: Yes, it would. Hard drive write caching problems are not limited to databases, it's just that databases are more susceptible due to a single byte of corruption can trash the entire database and databases frequently have open transactions/uncommitted data. I recommend you email the customers of your 6,000 terminals, informing them that the database was misconfigured, and recommend that they disable hard drive write caching. Before doing that, you may want to consult with your company's legal counsel on the best strategy for handling this, but in my experience it's best to be up front. – Greg Askew Jul 29 '18 at 15:06
  • 3
    @Yehosef note that disabling write caching in the OS will not fix the fact that your drive corrupts data on power loss. For the sake of higher speed and durability consumer grade SSDs may not write data to non-volatile memory when you write to a file, and unfortunately there's no *hardware* mechanism for the drive to take the data from the volatile cache to non-volatile storage on power failure, only enterprise SSDs can do that. Believe it or not I was in a similar situation where somebody bought a lot of consumer SSDs, our supplier who quoted this hardware had no idea this would happen. – jrh Jul 29 '18 at 17:30
  • @GregAskew - I understand that the write cache isn't connected only to the database, but could I get corruption on some static file I changed last week? Or would disabling write-caching help with meta-data corruption? – Yehosef Jul 30 '18 at 07:34
  • 1
    FYI - I added a follow-up question to this focusing more about the meta-data loss and how that happens. https://serverfault.com/questions/924054/how-does-ssd-meta-data-corruption-on-power-loss-happen-and-can-i-minimize-it – Yehosef Jul 30 '18 at 08:40

3 Answers3

17

When suddenly losing power, MLC/TLC/QLC SSDs have two failure modes:

  • they lose the in-flight and in-DRAM-only writes;
  • they can corrupt any data-at-rest stored in the lower page of the NAND cell being programmed.

The first failure condition is obvious: without power protection, any data which are not on stable storage (ie: NAND itself) but on volatile cache only (DRAM) will be lost. The same happens with classical mechanical disks (and that alone can wreak havoc on filesystem which does not properly issue fsyncs).

The second failure condition is a MLC+ SSDs affair: when reprogramming the high page bit for storing new data, an unexpected power loss can destroy/alter the lower bit (ie: previous committed data) also.

The only true, and most obvious, solution is to integrate a power-loss-protected DRAM cache (generally using battery/supercaps), as done since forever by high-end RAID controllers; this, however, increase drive cost/price. Consumer drives typically have no power-loss-protected caches; rather, they use an array of more economical solutions as:

  • partially protected write cache (ie: Crucial M500/M550/M600+);
  • NAND changes journal (ie: Samsung drives, see SMART PoR attribute);
  • special SLC/pseudo-SLC NAND regions to absorbe new writes without previous data at risk (ie: Sandisk, Samsung, etc).

Back to your question: your Kingstone drives are ultra-cheap ones, using unspecified controller and basically no public specs. It does not surprise me that a sudden power loss corrupted previous data. Unfortunately, even disabling the disk's DRAM cache (with the massive performance loss it commands) will not solve your problem, as previous data (ie: data-at-rest) can, and will, be corrupted by unexptected power losses. If they are based on the old Sandforce controller, even a total drive brick can be expected under the "right" circumstances.

I strongly suggest to review your UPS and, in the mid-term, to replace these aging drives.

A last note about PostgreSQL and other Linux databases: they will not disable the disk's cache and should not be exptected to do that. Rather, they isses periodic/required fsyncs/FUAs to commit key data to stable storage. This is the way things should be done unless a very compelling reason exists (ie: a drive which lies about ATA FLUSHES/FUAs).

EDIT: if possible, consider migrating to a checksumming filesystem as ZFS or BTRFS. At the very least consider XFS, which has journal checksum and, lately, even metadata checksum. If you are forced to use EXT4, consider enabling auto-fsck at startup (fsck.ext4 is very good at repair corruption).

shodanshok
  • 44,038
  • 6
  • 98
  • 162
  • Excellent answer. Please see my related question https://serverfault.com/questions/924054/how-does-ssd-meta-data-corruption-on-power-loss-happen-and-can-i-minimize-it - if you want to copy/adapt this answer there I'd be happy to upvote/select it. It sounds like disabling the write-cache would help only for the first case. Do have more details on the second failure mode? Is it connected to rebalancing/garbage collection or just proximity? – Yehosef Aug 09 '18 at 11:49
  • 1
    @Yehosef Give a look here, in the "power loss" section: https://www.anandtech.com/show/8528/micron-m600-128gb-256gb-1tb-ssd-review-nda-placeholder – shodanshok Aug 09 '18 at 13:15
  • 1
    The problem with any software solution is that many SSDs outright lie to the operating system about whether or not the data is safely stored or not, including in response to fsync/FUA commands. For enterprise drives that have sufficient energy storage to complete the flush of its cache when power is cut, this isn't a problem. – BeowulfNode42 Sep 16 '18 at 12:00
  • @BeowulfNode42 ATA barriers and FUAs are *required* to be honored. While in the IDE/PATA days some drive faked flushes, nowadays any such "liar" drive is not SATA/SAS compliant, and should immediately tossed away. – shodanshok Sep 16 '18 at 13:17
  • and yet those non-compliant drives are sold anyway, particularly in the consumer market segment. – BeowulfNode42 Sep 17 '18 at 00:36
  • @BeowulfNode42 Can you point at some evidence? *All* the drives I tested (both HDDs and SSDs) in the latest 10+ years correctly honored ATA FLUSHes / FUAs – shodanshok Sep 17 '18 at 07:36
  • Try the Intel 520 vs Crucial M500 listed on https://www.percona.com/blog/2018/02/08/fsync-performance-storage-devices/ with similar era tech, with similar general performance but the Intel is able to fsync 19 times faster... – BeowulfNode42 Sep 17 '18 at 11:35
  • Thanks for the link. Intel 330/335/520 were universally recognized as *bad* disks from data corruption standpoint. They even fails with **disabled** DRAM cache, as you can read [here](https://nordeus.com/blog/engineering/power-failure-testing-ssds/). In other words, they had to be avoided at all costs. In retrospect, it is really incredible how Intel sold drives so broken and unreliable. As comparison, an old Samsung 840 behave correctly in the linked test. – shodanshok Sep 17 '18 at 17:58
  • @shodanshok I am interested in case 2 - Is there any further reading about this? How does having a power-protected SSD solve the second point? – jj172 Apr 26 '19 at 04:14
  • I think I found some info about it here: https://www.micron.com/~/media/documents/products/white-paper/ssd_power_loss_protection_white_paper_lo.pdf – jj172 Apr 26 '19 at 04:21
11

Yeah. Don't get super cheap SSD - anything outside the low end consumer market has capacitators and full protection against power loss. Amd really does not cost that much more.

TomTom
  • 50,857
  • 7
  • 52
  • 134
  • They are Kingston - so I don't know if those are considered cheap or it's a defective lot. The bigger problem is that the units (~6k) are already in the field and most are not failing (perhaps just because haven't have power-loss). So replacing them is an expensive last resort which we haven't hit yet. – Yehosef Jul 29 '18 at 13:52
  • added drive info to question. – Yehosef Jul 29 '18 at 13:59
  • 5
    They are super cheap. They are price oriented end user drives. Look for small enterprise drives. READ THE SPECS. Generally Power Failure protection is something that is in the spec. – TomTom Jul 29 '18 at 14:07
  • 1
    To add to @TomTom - sometimes it isn't actually called Power Failure protection - and sometimes Power Failure protection isn't really truely power failure protection! You have to do some reading for each manufacturer and find out what they call it for their particular brand of enterprise SSDs. (Look, for each mfr, for white papers they've written on how truly superior their own enterprise SSDs are.) And, I have found that, at least for single purchases, it _does_ cost quite a bit more. But I do not do bulk purchases and it could be different for quantities of 100 or more, I suppose. – davidbak Jul 30 '18 at 01:55
  • 3
    From what I've read so far, these manufactures have the names for this feature as: Kingston = "Pfail" as on the DC400 series; Samsung ="Power Loss Protection"; Intel = "Enhanced Power Loss Data Protection"; Sandisk = "Data loss protection with power fail protection". I don't know what other manufacturers call it, but in depth reading of spec sheets is required. Note it can also be achieved with firmware if the manufacturer provides it. If you really have >6000 of them I would contact Kingston and explain the situation and offer to pay for the firmware per drive. – BeowulfNode42 Jul 30 '18 at 06:37
  • @BeowulfNode42 - I'm surprised it can be done with firmware, especially without loss of performance. My understanding was that this kind of power loss protection requires supercaps (at a minimum, early enterprise SSDs actually had batteries), thus, actual hardware. – davidbak Aug 01 '18 at 22:31
  • @davidbak If using just firmware to protect data it WILL sacrifice some performance. However, it will still be faster than a mechanical drive. Then the obvious choice is to protect your data and have something faster than a mechanical HDD. If you want to pay the small amount extra when making the original purchase to buy one with caps (they don't have to be super, just enough of them to flush the write cache). However replacing a lot of SSDs that have already been purchased and installed is a significant cost. – BeowulfNode42 Aug 01 '18 at 23:06
  • Some manufacturers use term "data at rest" like: "Power-Loss Protection (data at rest)" which is not a true PLP protection. "data at rest" is data that's already stored. "data on fly" is important - that's the data in middle of processing by drive and cheap drives don't support that. So be aware of marketing in the specs. – arekm Dec 14 '18 at 08:59
7

The first thing to do is to define recovery time and recovery point objectives. How long do you have to recover one of these terminals, and what data point in time is acceptable? Perhaps within a couple hours you need to be capable of recovering to last week's backup.

All sorts of strange things can happen to files if in flight writes are lost. File system priority is maintaining their own metadata consistency, they may not provide the same guarantees for your data. In other words, fsck isn't guaranteed to recover your data. Its job is to get you a file system that will mount.

So, power. Install, configure, and test that UPS will shut the system down gracefully. This allows file system caches and the drives themselves to write.

And, durability of the writes to the disks. Read PostgreSQL's reliability chapter. Use the diskchecker.pl script linked there to do a crash test and determine if the SSDs are lying about if writes got to non-volatile storage. If there is loss, consider replacing with SSDs known to have power loss protection.

Edit: you added details that write cache was enabled. You can attempt to disable that: hdparm -W0 /dev/sda or the appropriate command for a hardware array. Reference: RHEL storage administration guide.

File system write barriers enforce an order of journal commits. Its not a guarantee the data will be intact, but its safer for the file system with a volatile cache. Although it is the default, adding the "barrier" mount option clearly documents you value consistency over performance.

Finally, the last line of defense. Do a restore test to ensure you can get your application and database to desired point in time. This is useful for all kinds of data loss, not just power failure.

John Mahowald
  • 30,009
  • 1
  • 17
  • 32
  • This disk write caching is the likely answer. For some unknown reason, it seems Postgres does not disable disk write caching, which is a terrible default setting. – Greg Askew Jul 29 '18 at 13:09
  • 1
    To clarify - we have daily backups and we are syncing the data to the cloud, so the problem is less connected to losing Postgres data (it is a concern, but I think there are PG config options that can help.). The more concerning problem is the machine becoming unusable connected to the metadata weirdness. FWIW, usually the machine boots and we can connect to it, but the application fails because its files have been scrambled. – Yehosef Jul 29 '18 at 13:50
  • I'm not sure if this goes for all kinds of SSDs, but a Samsung representative told me that a UPS won't help, and only an enterprise grade drive will fix the problem. – jrh Jul 29 '18 at 17:27
  • 1
    "it seems Postgres does not disable disk write caching, which is a terrible default setting." @GregAskew Please demosntrate how to disable the DRAM cache on coimsumer SSD. It can not be disabled. – TomTom Jul 29 '18 at 18:15
  • I added a mention of disabling write cache on the device. That's going to have a performance penalty. SSD power loss protection and a UPS for graceful shutdown is sufficient. – John Mahowald Jul 29 '18 at 18:29
  • @TomTom: I'm not disagreeing, but I also don't understand why write caching would even be useful on an SSD? – Greg Askew Jul 29 '18 at 20:59
  • 4
    Because of the way SSD work. Without write cache you would burn out the SSD a lot faster. SSD cells are large and always need to be completely written -so the ability to combine multiple small writes is crucial for SSD lifetime. Which is why you CAN NOT disable it on consumer drives (the drives lie or do not allow it) AND can not do it on enterprise drives (the drives basically can lie as they are non volatile - they have enough energy reserves to write the dram out to flash. – TomTom Jul 29 '18 at 21:10
  • @TomTom Ironically, the *really* cheap consumer SSDs are now often DRAMless. Worse performance and durability ... but less cached data at risk of corruption on power loss. – Bob Jul 30 '18 at 00:31
  • 3
    @Yehosef No, not even reliable Postgres has the power of magic to recover if it sent data to the drive, the drive says “Good, got your data”, and then the drive never got around to writing that data from its internal temporary volatile cache to the actual nonvolatile storage. It is crucial to use only enterprise-quality storage where the drive or raid unit has its internal cache backed by battery or capacitor. Postgres has features (WAL file, etc.) to protect you from losing data *not yet sent* to the drive, but Postgres cannot recover data lost *inside* the drive. – Basil Bourque Jul 30 '18 at 04:40