Since it's probably a hardware fault, I'd look at some hardware diagnostics.
If you have a hardware RAID controller, I'd find out if you can read its log (if 3Ware, use tw_cli). And, whether you have hardware or software RAID, you can look at the SMART parameters of the disks (if the disks are connected to a RAID controller, you may need special commands to access them. See the smartctl
manpage).
If you do:
smartctl -a /dev/sdX
I always primarily look at:
- Reallocated sector count. Is especially bad when its increasing over time. And, I don't fully trust a disk that has any reallocated sectors.
- Look at the SMART error log. It's tricky to read at first, but the primary thing is to see if there are events, and at what time (expressed in disk age in hours) they occurred. You can see the current disk age as one of the SMART parameters. If it's recent, it may be related.
Also, keep an eye on dmesg and syslog to see if you have get errors over time. For example, disk errors often show up long before it's a fatal problem as ata exceptions. We have a central logging server (using rsyslog) that notifies me about ata exceptions. A quick example on how to set that up:
/etc/rsyslog.d/60-smtp.conf:
$ModLoad ommail
$ActionMailSMTPServer localhost
$ActionMailFrom noreply@example.com
/etc/rsyslog.d/70-mail-ata-errors:
$ActionMailTo you@yexample.com
$template mailSubjectATA,"ATA error on %hostname%"
$template mailBodyATA,"You have ATA errors. Mostly it's the disk and you get these errors before a possible mdraid setup kicks the drive.\r\nBEWARE: ata1.00 is first ata, first disk. Ata1.01 is first ata, second disk. Use the ata-to-device-names.sh script to identify devices.\r\n msg='%msg%'"
$ActionMailSubject mailSubjectATA
$ActionExecOnlyOnceEveryInterval 3600
:msg, regex, "ata.*exception" :ommail:;mailBodyATA
See here for the ata-to-devicenames script.
Another thing you can do is a memtest. Ubuntu installation DVDs/CDs have those in the boot menu, and I believe any Ubuntu server has one in its regular boot menu as well. Let is make one pass at least, more if possible.
Do you have ECC RAM BTW? ECC RAM is important for long term stability and data integrity.