Two systems freezing up: probably a RAID/MB sata controller failure?

1

I have two systems with approximately the same age with similar trouble:

First system:

ASUS P8h67-m-le mother r3 version i7 2600 (3,4ghz)
8gb ram ddr 3 (2x4gb dual channel) RAID 1 via intel RST with 2 hd 1TB WD Green Coolermaster 600W psu Windows 7 64bits professional edition (original license). system protected from power surge via 10kva UPS

Symptoms:

System was working OK for almost 3 years now. Last month RAID degraded and rebuild upon bad power off from an app hang. After that degraded and rebuild several times upon powering off using normal power off and on. Past two weeks, the system started hanging out (freezing completely, sometimes the mouse pointer also freezes while other's mouses still move). [apparently it is increasing frequency of freezes]
After that I had to reset system. Every time it started regenerating RAID 1 (it takes four hours to rebuild RAID), and now it is freezing about one time per day.

Things I have tested:

  • New ram and new PSU give the same problem.
  • Apparently without RAID (removing 1 hdd) seems to solve problem.
  • HDD is fine (tested in other system with stress test, short self test and long self test). Also view smart logs seems OK.
  • Stress test processor passed.
  • Checked temps are OK, system no overheating.
  • Move one hdd to another system with Intel RST and can't access it (Bios sees drive, controller does not show it, but windows hardware manager does show it) , move that same drive to another system without intel RST and CAN ACCESS IT???
  • Moving server app to other similar system solve problem so its not an app issue it has to be hardware related.

Problem: when the system freezed, I got nothing from Windows event log. No app hang, no RAID trouble, nothing. RST log on Windows sucks no detail about which hdd got out of sync just degraded status (at least in my system).

Strange thing I noticed: Adding another internal HDD to the system (outside the RAID to make backup) seems to trigger RAID degraded and start regenerating RAID 1 .

I guess the MB is failing

Second system:

I5 processor cant recall specs now Asus MB h81m-k RAID via software Windows 7 64bits. 8GB ram 2x 1TB HDD Caviar blue

Symptoms: Server runing fine for 2 years aprox. A month ago: Server Windows RAID go out of sync, try to resync never finished (four days waiting)
Server Application started to hang frequently (no reboot required just reopen app) or closes terminals connections. Moving server app to other similar system DID solve problem so its not an app issue.

Tests I have conduced:

Format one hdd in the systen (format took over one day and never finished). So I remove hdd (the one that go out of sync and try it on another system). Format ended at normal time. Looked smart and seems OK. Moved that hdd to system and try to resync Soft RAID again: never concluded re syncing of drives. Change PSU & RAM did not solve problem Removing the HDD that got out of sync did not solve problem cause app hangs anyway.


What both system have in common:

  • two years usage approximately
  • heavy HDD read/write
  • server apps are diferent
  • same brand of hdd.
  • HDD seems fine on other systems.
  • Same OS both legal copy
  • memory and psu are not the cause.
  • No sign of visual damage on MB
  • No one touched systems internals.

My guess is that somehow sata ports/controller can't handle intense HDD activity and degraded/broken over time generating failures that look different on both systems due to nature of different raid types.

Ramiro85

Posted 2016-03-25T20:13:45.757

Reputation: 11

Answers

0

Don't just rebuild your raid over and over! Figure out why the card failed the drive and (most likely) replace the drive. You can use a command line utility called smartctl to check the drive - if anything has more than around 500-1000 errors, or has a number of errors that keeps on increment, then its probably time to replace the drive.

If you have to rebuild a raided drive after docking it to something else it's because the raid specific info got mucked up a little (the dell/LSI perc cards have this extra piece of data on the drive, but I've never triggered a rebuild by manually mounting the drive). Lastly, all sorts of hardware problems can cause a system freezes. A bad raid card can cause a freeze, as would electrical problems with your hard drives, or controller problems on your hard drive. Occasionally filesystem corruption issues will trigger a kernel crashdump, but that should be really visible and obvious if that's the cause of the problem. Something weird I saw once was the heat from a worn out bearings on a hard drive were causing temperature problems with a computer (laptops are prone to that) - like a worn hard drive could cause temperature problems with your video card which could totally freeze everything. It doesn't hurt to check the kernel messages from right before your machine froze, /var/log/kern on Debian/Ubuntu. Having an insufficiently powerful power supply can cause crashes. In general try disabling unneeded hardware until the system stops crashing :).

Some Linux Nerd

Posted 2016-03-25T20:13:45.757

Reputation: 126

Thanks for the help
My drives just fail in RAID 1 mode. If I removed drive and put it in other system drives works fine!. "... probably time to replace the drive." Have looked smart in the drives n° of errors is constant. ** The Problem is intel RST does not show which drive is failing neither in the OS event viewer (nothing here except when I remove hdd an boot system which dose leave a log) nor when the RAID ROM BOOTS outside os. It just shows a Degraded status on both HDD.** "...filesystem corruption ..." Forgot to mention Checked filesystem with chkdsk
– Ramiro85 – 2016-03-26T04:08:45.973

The smart data for a raid controller usually isn't too helpful. If you can safely attach a drive to another machine without the card (if you haven't done so) then you can get the real smart error info. Thats weird, so the raid card says the fs is corrupted and windows says its ok? ummmmmmm /me shrugs – Some Linux Nerd – 2016-03-28T18:54:04.387

Usually it marks drives as failed because it tries to write to a sector and the sector is bad or otherwise doesn't work. – Some Linux Nerd – 2016-03-28T19:44:35.770