Linux media box/mail server suffering filesystem/RAID errors, kernel panics

3

1

In the past week, the mini-ITX machine I built myself to serve mail and Samba shares has kernel panicked twice with filesystem-related stuff. Last night I noticed integrity errors when streaming a movie to my set-top client (video artifacts), so I started poking around.

Both the internal hard drive and the external hard drive use linux software RAID and on either mirror, if I do an md5sum on a fairly large file like a video, and do it repeatedly, I get a different checksum each time (I should note that one is ext4, the other is JFS). I booted off a USB stick into recovery mode, same thing happening. I haven't tried reading off the external mirror on another computer, but I did mount one of the constituent disks and it seemed fine, at least it was giving consistent md5sums there.

So, filesystem's been ruled out (it's happening on both ext4 and JFS), hard drives are probably out (it would be an incredibly coincidence), SATA controllers are probably out seeing as it's happening on two completely independent controllers, a corrupted kernel module or something is out seeing as it's doing it even when booting off of the rescue disk.

The fact this is happening to two separate sets of drives, controlled by two separate SATA controllers, running two different filesystems, and the behaviour is preserved when booting two different kernels makes me think the only plausible option is that there is something horribly wrong with the motherboard. This motherboard was already an RMA replacement from a company I don't particularly trust (Zotac), so it would be less surprising than usual.

This is Ubuntu Server 10.04, by the way, 64-bit, on a Zotac IONITX-C (I think) motherboard with an Atom N230.

Does anyone have any other ideas, diagnostics I should perform, etc.?

EDIT: Two things I forgot to mention: when I booted off the USB key I did run fsck on both md devices quite a bit.

This is what the panics look like:

enter image description here

I've tried searching Google a few of these without much success, but I think it's more likely the hardware to blame anyway; I just don't know which specific piece of hardware.

EDIT 2: Just ran memtest86, and not a single test is passing. The least significant 2 bytes of the test pattern seem to be always read back wrong. Still not sure whether it's RAM or chipset, and I don't have an extra stick of RAM to test with.

dwf

Posted 2010-07-26T19:55:32.403

Reputation: 133

Did you fsck the partitions when you booted off the USB device? – matthias krull – 2010-07-26T20:42:46.027

Yes, multiple times. They were fine. – dwf – 2010-07-26T21:14:36.883

Answers

1

My vote is RAM going bad, or possibly something on the chipset. Can you swap the RAM with known good RAM and see how it goes? - (most modern linuxes have a "memtest" option on the install disc also, that you can try out if you don't have known good ram lying about, although I'd suggest going for good RAM as a better test.

gkrash

Posted 2010-07-26T19:55:32.403

Reputation: 26

Thanks for reminding me, I'll definitely run memtest86 when I get home. – dwf – 2010-07-26T23:26:54.550

It turns out that one of the sticks of RAM is was defective, the other one is fine. I've submitted an RMA request. Thanks! – dwf – 2010-07-27T03:02:25.120