0

At our startup, while trying to move our applications to a new server we ran into a hung up Ubuntu 16.04 at the new server, responding to pings but ssh hanging at

debug1: Entering interactive session.

and no login showing up when trying to login with screen + keyboard directly.

After power cycling the server it booted without any errors, but some (or all?) files that have been opened by applications like logfiles had some blocks of zeros at their end, so the files seem to be corrupted.

Our disk setup is as follows:

3 x SSD, configured as Software Raid 5 (mdadm) with LVM on top:
    - 1x ext4 Logical Volume for Host OS(Ubuntu 16.04)
    - 1x ext4 Logical Volume holding mysql datadir used from a Virtual Machine

3 x HDD, configuredd as Software Raid 5 (mdadm) with LVM on top.
    - Raid is congigured for 4 disks, with one missing that we'll add later
    - 1x ext4 Logical Volume for data storage

Server Configuration:

384 GB Ram
2x Xeon E5-2620 v4

My questions are:

  • Are the files currupted in a way that we need to use a backup or shall we continue operation?
  • How could a fresh installed system hang after such a short period of uptime?

My guesses are:

  • files maybe currupted, so we should use a backup
  • the crash may have occured due to the os file system cache filling up rapidly and a possible misalignment of the raid+lvm+virtual machine setup that made the SSDs too slow to keep up with i/o resulting in a frozen system
stambata
  • 1,598
  • 3
  • 13
  • 18
DaPsul
  • 1
  • 1

1 Answers1

0

What was this system doing when it locked up? More info is needed to speculate on causes...

I'd be concerned about the mysql database or anything else important that was being written to. Check your database! Run a data scrub on each array and fsck on each filesystem, maybe this is repairable. If there is any concern about data integrity, restore from backup.

https://wiki.archlinux.org/index.php/Software_RAID_and_LVM#Scrubbing

I see no reason why software RAID or LVM, slow SSDs, FS cache, etc... should be considered the primary culprits here. There could be many other reasons. My first concern would be hardware problems (like RAM). You can check that too with various tools.

You don't mention - is the host experiencing corruption or is it the virtual machine?

Ryan Babchishin
  • 6,160
  • 2
  • 16
  • 36