Server hang - data loss on reboot, post mortem analysis

Question

A development server I'm responsible for (ext3 on raid 5 w/Debian Squeeze) froze up over the weekend and I was forced to reset it, as in unresponsive from KVM/physical keyboard access, no eth devices responding, etc. Not even the backup process ran (Figures, the one time I don't check for confirmation)

So after the reset, it turns out that every trace of ~~disk IO~~ activity that should have happened for a period of ~24H is completely gone. The log files have a big gap in the dates and times. As if the writes were never committed to disk, no processes seemed to have run.

Luckily it was a weekend and nothing of value would have been lost and I don't suspect a hack.

What can I do in post mortem to this event - to prevent it from ever happening again? I've seen this happen before on a completely different machine running FreeBSD.

I am rounding up the disk checking tools right now - but there must be more going on!

Mount options: /dev/sda1 on / type ext3 (rw,errors=remount-ro)
Kernel: Linux dev 2.6.32-5-686-bigmem
Disk/Inodes: 13%/3%

Do you have any hardware details? It sounds like a disk subsystem issue. Did you look at the kernel messages? dmesg? — ewwhite, Jun 20 '11 at 19:55
messages/dmesg have nothing for the duration of the freeze. Hardware is a Dell R710 with 15k SAS Raid5 on PERC H700, Drives are SEAGATE ST3146356SS. All 3 drives have "OK" smart codes and pass their self tests. — thinice, Jun 20 '11 at 20:15
Maybe the server froze at the time the disks stopped persisting, not the the other way around. When exactly was the server marked down in your monitoring environment? — Martin M., Jun 20 '11 at 21:17
About 01:12am local Sunday (06/19). Most accurate timestamp in a log file I can find is 01:12:12AM. I don't see any environmental factors (power blips, etc) that would correlate either. — thinice, Jun 20 '11 at 21:26
How do you know that the "disk IO that should have happened for a period of ~24H is completely gone" versus "no disk I/O happened because the server locked up and no processes ran"? — Mark Wagner, Jun 20 '11 at 23:25
@embobo You're right - I've updated my question to make the issue more broad than IO; processes that should have run did not (e.g.: backup). — thinice, Jun 21 '11 at 18:33
Something like this happened to me because of a bug in the firmware of my raid controller. Did you check the servers logs (in the bios) or the controllers logs? — Jure1873, Jun 23 '11 at 16:51

score 1 · Accepted Answer · answered Jun 23 '11 at 21:05

1

Sounds familiar to me. Do you have an Intel-CPU? If so, what are your green mode-settings in the BIOS? Is your BIOS up to date?

What Intel-Microcode-patch does your Debian apply during boot?

I had similar situations where an R310 froze up (weekends during times where nothing happened). This was fixed by an Intel-microcode update (CentOS 5 in my case).

Dell recommended a BIOS-upgrade, which in turn applied the same microcode update.

In other cases I have seen Intel-C-sleep-states to be responsible.

answered Jun 23 '11 at 21:05

Nils

7,657
3
31
71

Intel E5620, BIOS says the C-states are enabled. BIOS is 2.1.9 (2010-08-13) - I don't see anything about microcode for intel – thinice Jun 23 '11 at 23:15
I am not sure about debian - but normally the service for the intel-microcode is named "microcode_ctl". Anyway - in your case: Disable the C-states in the BIOS. – Nils Jun 24 '11 at 08:35
I wouldn't expect the microcode to be updated from userspace after the system is already running, better to update the system firmware (BIOS, ESM, RAID) so that the current microcode is loaded before the kernel boots. – mtinberg Jun 24 '11 at 22:55
@mtinberg Many hardware vendors and system admins don't expect this. But on some systems microcode_ctl is on by default. So you can end up with an old microcode although your BIOS provides a newer one. I first stumbled across that on a SLES9 system - which used a rather old microcode. – Nils Jun 25 '11 at 20:18

score 1 · Answer 2 · answered Jun 24 '11 at 22:56

If you don't have an OOPS message from the kernel as to why it locked up then you aren't going to be able to troubleshoot much further. You might be able to set up kdump to save some debug output should it happen again and you could run memtest86 or some other hardware diagnostics but without further information you can't move forward.

Server hang - data loss on reboot, post mortem analysis

2 Answers2