Repeated disk failure on Dell T610 Server

Question

I purchased a used Poweredge T610 and upgraded it to 2x Hexcore Xeon X5675 processors and 96 GB RAM. Initially, I used 3 WD green 2TB drives in a RAID-5 array (Perc6i controller) and installed Ubuntu server on the virtual disk. This setup served me well for about a year and then the problems started:

I bought some new drives to expand as a second array - 4x 3TB WD red drives. In the meantime I had learned that at least WD green is not a good choice, so I wanted to back up some data on the new VD. Turns out that the Perc6i does not like drives >2TB, but it recognised the first 2 of 3 TB. I had not started setting up a VD with the new drives yet, but 3 weeks later, my WD green array started corrupting (first only strange glyphs in some software, then more severe issues up till corrupted boot sequence). I ended up with a professional data recovery service who luckily could help me. I exchanged the Perc6i for a H700 and set up a RAID6-array of 4 3TB WD red drives (which I tested with the dell hardware diagnostics extended test before setting up - no errors on any of them). Install Ubuntu, all software I need, x2go etc... Up and running again.

Now I get the same problem as before - in X2go it starts with the same software (bioinformatics artemis package) spitting out glyphs in the command line and it seems I am going back to square one. All status LEDs on the caddies are constant green, i.e. online. No predicted failure that the system recognizes at least.

I am starting to wonder what the problem could be:

What I don't think is likely: -primary disk failure (again!), since the drives were new, had no bad sectors upon extended testing and haven't had much power-on-time at all. -the perc6i controller has been exchanged for a H700 after the first disaster and should not be the problem

What I need help to evaluate: -backplane /cable issues? (The H700 controller came with cables for another server type that did not fit my case - simply used another SATA6-cable to connect the controller to the backplane) The drives are by the way sitting in the same bays as the previous, failing ones, with an original dell SATA-cable going there.

-Motherboard issues? -CPU or RAM issues? -Power supply (voltage peaks??)

Has anyone had a similar problem before? Any help here is much appreciated. Unfortunately I am away for another two weeks before I can get access to the server (both physically and network), the issue has been "reported" by my wife, who works with the server in our local network (but unfortunately won't be able to help troubleshooting).

Yes I did run the complete Dell hardware diagnostics procedure, without any issues. Only one of the drives was detected with defective blocks, but I was unable to rebuild the raid 5 array, hence the data recovery specialist. All other hardware was ok

I just wonder if there could be inconsistent problems like glitchy contacts anywhere that can go through the tests at one point and fail an any other time. Or if the tests don't cover all scenarios...

Did you test the rest of your hardware when the problems arose in the first place? — Spooler, Mar 11 '18 at 19:18
Are all drivers up to date? Also be aware that some vendors do not recognize drives that are not their own. Even if a Dell branded drive is made by say Western Digital, the firmware is different and the Dell controller may have issues with the non-Dell drive — Dave M, Mar 11 '18 at 19:32
Drivers? Or firmware? The Ubuntu system was newly set up. So the drivers should be ok. Firmware I must admit I don't know. I think I did upgrades a year ago when I bought the server as used, but I may want to start off checking and updating again when back at home. — kruemelprinz, Mar 11 '18 at 19:36
Is there any way to make the server swallow unbranded accessories and drives? — kruemelprinz, Mar 11 '18 at 19:37

score 1 · Answer 1 · answered Mar 12 '18 at 05:04

From experience, it sounds like a ram corruption issue. First thing I would try is a memory diagnostic tool. Dell has them available via download.

If that finds no errors I would pull all hardware to get down to bare minimum needed and then add them back until you see the issue. Very time consuming but sometimes the only way if diagnostics show nothing. Obviously, it is difficult to do this with hard drives, but you can do this with cpu and ram. Don’t forget to add things back one at a time or else you won’t know which one is to blame.

My other suggestion is to use a hypervisor and create virtual machines instead of installing on bare metal. This will make restoring functionality in the face of failures much easier. Also, establishing a backup regime before installing applications will help you avoid needing data recovery services again.

Wow. That sounds devastating... Given the fact that the problem started occurring at least a week after a fresh install we are talking about months until the actual ram chip or CPU is identified but then it could still be the board, couldn't it? Like a defective ram slot or CPU socket causing the exact same problems... Probably it is easier to just replace everything. Thanks for your input! — kruemelprinz, Mar 13 '18 at 15:27

yagmoth555 · Answer 2 · 2018-03-11T21:47:27.297

Bad luck? Test the HDD in another new computer please to see their current status.

Keep in mind a T610 is like 9yr old. I honestly think any current desktop would be faster than a T610.

The drive firmware can impact, but your array would flag them as foreign disk, the fact you changed them all at once is better, no dell drive with their firmware with vanilla drive mixed, the controller would not allow that.

Their firmware on the disk allow the controller to make advanced function with the disk, while an array, if vanilla disk with normal firmware are used, will act normal.

The fact you array was detected make me think the controller can see them and use them. Its why I state at first bad luck..

Repeated disk failure on Dell T610 Server

2 Answers2