0

I am helping with an interesting data recovery issue. The laptop was running Debian 11 Linux with LUKS full disk encryption on a Samsung 970 Evo Plus 2TB SSD for quite many months. Suddenly the LUKS password input on boot stopped working.

I connected the disk to another machine and noticed that attempts to mount it caused blk_update_request: critical medium error in journalctl and increasing Media and Data Integrity Errors in SMART attributes.

This looked like a bad sign so I started running ddrescue to make a full clone of the disk. ddrescue failed to read only 96 sectors (~50KB) from the entire 2TB SSD. But the interesting part is that problematic sectors appeared exactly in the 16MB area where LUKS2 header is located (and nowhere else). Note that LUKS header on this disk starts at the ~1GB position and the first 2 unencrypted boot partitions are not effected (can can be mounted and data looks good). But this failure in the LUKS header area makes all other data unrecoverable (without LUKS header backup).

It seems suspicious that in a 2TB SSD problematic sectors appeared exactly in the 16MB area where LUKS header is located (and nowhere else). Is this just bad luck or something made it more likely for the failure to appear exactly in the LUKS header? I understand that evil malware could override data in the LUKS header area, but could it cause "Data Integrity Errors"?

Also I am considering any ideas if the failing sectors could be recovered with other methods. So far basically just tried running ddrescue with some more retries from multiple machines.

UndercoverDog
  • 612
  • 2
  • 17
  • 2
    I'd wager "bad luck" here, but that's really more intuition than anything. Sure, it *could* be malware, but it's very unlikely that it'd physically damage the SSD. And even if so...why? – The one who tests Aug 28 '22 at 22:25
  • I don't know how LUKS works, so possiblg a silly question, but does the header get rewritten frequently? If it does, wear-levelling probably should make it not a problem, but... – TripeHound Aug 31 '22 at 15:13
  • Did you check the disk with the SMAART. It is unusual for an SSD to fail like that. – kelalaka Aug 31 '22 at 20:40
  • @Theonewhotests, yes, malware specifically is very unlikely, but I guess there could be some data access pattern / bug that made the failure in this area more likely. I can update the question with more details, but basically the first problematic sectors appeared in the middle of LUKS "keyslot 0" which is 258048 bytes in size and all other bad sectors appeared within the rest of LUKS keyslot area which is ~16MB in size. If failure was random the odds of this happening look quite astronomical. – stackrunner Aug 31 '22 at 22:45
  • @TripeHound all problematic sectors are located in the "Keyslot area" of LUKS header. I have not looked much in detail, but logically I think this area should never get rewritten during normal use (only when modifying encryption passwords). https://vhs.codeberg.page/post/external-backup-drive-encryption/assets/luks2_doc_wip.pdf – stackrunner Aug 31 '22 at 22:56
  • @kelalaka, yes, I was looking at SMART attributes and basically they predictably increment `Media and Data Integrity Errors` attribute when reading the problematic sectors, everything else looked quite normal. Searching for "970 evo data integrity errors" points to some similar cases, for example, https://apple.stackexchange.com/questions/401798/investigation-about-read-errors-on-my-ssd). – stackrunner Aug 31 '22 at 23:03

0 Answers0