4

I've recently setup a couple of new servers. This time I'm encrypting most of my partitions using dmcrypt+LUKS. However these new servers crash very often, every few days. Full lockups, kernel does not respond to keyboard, system does not ping. According to Munin graphs and atop records, there has been no increase in resource usage. There are no relevant log records in the local syslog logs. There are no relevant records on our remote log host (which the new servers forward syslog to). There are no relevant netconsole messages (the new servers forward all kernel messages using netconsole to a log host). The kernel didn't even print anything to the TTY. I asked the hosting company to perform a full hardware test, and they found nothing. I'm suspecting LUKS. Does anybody else also experience full lock ups with LUKS? The only reference I could find is http://ubuntuforums.org/showthread.php?t=2125287.

Hongli Lai
  • 2,112
  • 4
  • 22
  • 27
  • Full lockup, meaning once we are booted up. So, luks has already encrypted the disk. Now, what fs? Recently, I saw an issue with wrong partition alignment in hdd and dmcrypt trying hard to write that data on edge of disk, resulting in lock up. Is there nothing in logs? Any soft lockup message, hung task, some stack trace? Does the server respond to sysrq key combo. – Soham Chakraborty Jul 17 '13 at 19:11
  • Yes, the lockup happens hours or days after booting. The filesystem depends on the partition. Some are ext4, some are xfs. The latest crash resulted in absolutely no logs. It's as if the server suddenly powered off. A last crash resulted in multiple soft lockup messages, all showing dmcrypt in the stack trace. But during most crashes, there are no messages. The system doesn't respond to sysrq. – Hongli Lai Jul 17 '13 at 23:18
  • Pastebin the soft lock up messages if possible please? Don't leave a stuff, from bottom to top, the entire traces. – Soham Chakraborty Jul 18 '13 at 01:08
  • Here are all the messages for the crash that did result in messages. This crash happened on server C. Newlines are bit strange because these came from netconsole, but the messages are otherwise readable. At the time, we were running kernel 3.2.0. We've since upgraded server A to kernel 3.8.0, server B and server C to 3.5.0. Despite the kernel upgrades, servers A and B crashed yesterday, but with no messages. https://gist.github.com/FooBarWidget/6028587 – Hongli Lai Jul 18 '13 at 11:28

1 Answers1

2

I had similar problems when trying to set up an Arch and Debian system on a dmcrypt+LUKS partition. The issue always surfaced while secure-erasing the LUKS partition using the dd if=/dev/zero of=/dev/mapper/crypt1 command, after around overwriting 6-7GB of data. It turned out to be faulty memory module, one out of 4x4GB.

Point 4.3 on the cryptsetup FAQ page describes how faulty memory can cause drastic corruption while writing to encrypted devices, and related symptoms like freezing and lock-ups, which lead me to suspect a faulty memory.

If I were you I would be suspicious about how that hosting company checked their systems. Tell them to forward you the results of at least one cycle of Memtest86+ and Memtester.

NOTES

Just for reference I am listing some of the posts/discussions describing similar issues I went through while searching for hints and solution:

  • This guy had some CPU lock-ups reported by the watchdog processes. Though it seems his issue is not related to encryption or faulty memory, rather a faulty CPU fan, this was when I started to suspect hardware problems.
  • These guys seem to have similar sympthoms, and the last sentence in the thread mentions "large amount of RAM".
  • This thread (also here) describes a soft lock-up issue with kernel version 2.6.24, a long time ago, for which a patch was submitted back then. The sympthoms seem similar, but the root cause for me was different. This post seem to describe the same issue too.