23
3
I have a small Ubuntu server running at home, with 2 hard drives. There are two software raids (raid1) on the disks, managed by mdadm, which I believe is irrelevant, but mentioning it anyway.
Both of the hard drives are Western Digital, and have been used for around 2 years, when one of them started making clicking noises, and died. I figured that maybe it's natural after 2 years, so I bought a new one, and resynced the raid arrays. After about a month, the other drive also died.
I didn't get suspicious, since both drives have been bought at the same time, it's not that surprising to see both of them near each other, so I bought another one.
So far, 2 old drives failed, and 2 brand new in the system. After one month, one of the new drives died. This is when it started getting suspicious. Since the PC was put together from some really old parts (think AthlonXP), I figured that maybe the motherboard's SATA controller is the culprit. Of course you can't switch parts easily in an old PC like this, so I bought a whole system, new MB, new CPU, new RAM. Took the just failed drive back, since it was under warranty, and got it replaced.
So it is up to 2 failed drives from the old ones, and 1 failed drive from the new ones. No problems, for 1 month. After that errors were creeping up again in /var/log/messages, and mdadm was reporting raid array failures. I started tearing my hair out. Everything is new in the system, it's up to the third brand new hard drive, it's simply not possible that all of the new drives that I bought were faulty.
Let's see what is still common... the cables. Okay, long shot, let's replace the SATA cables. Take hard drive back, smile to the guy at the counter and say that I'm really unlucky. He replaces the hard drive. I come home, one month passes and one of hard drives fails, again. I'm not joking.
Two of the brand new hard drives have failed. Maybe it's a bug in the OS. Let's see what the manufacturer's testing tool says. Download testing tool, burn it to a CD, reboot, leave hard drive testing overnight. Test says that the drive is faulty, and I should back up everything, if I still can. I don't know what's happening, but it does not look like a software problem, something is definitely thrashing the hard drives.
I should mention now, that the whole system is in a shoebox. Since there are a load of "build your own ikea case" stuff, I thought there shouldn't be any problems throwing the thing in a box, and stuffing it away somewhere. The box is well ventilated, but I thought that just maybe the drives were overheating. There is no other possible answer to this. So I took the hard drive back, and got it replaced (for the 3rd time), and bought hard drive coolers.
And just now, I have heard the sound of doom. click click whizzzzzzzzz. SSH into the box:
You have new mail!
mail
r 1
DegradedArrayEvent on /dev/md0 ...
dmesg output:
[47128.000051] ata3: lost interrupt (Status 0x50)
[47128.000097] end_request: I/O error, dev sda, sector 58588863
[47128.000134] md: super_written gets error=-5, uptodate=0
[48043.976054] ata3: lost interrupt (Status 0x50)
[48043.976086] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[48043.976132] ata3.00: cmd c8/00:18:bf:40:52/00:00:00:00:00/e1 tag 0 dma 12288 in
[48043.976135] res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[48043.976208] ata3.00: status: { DRDY }
[48043.976241] ata3: soft resetting link
[48044.148446] ata3.00: configured for UDMA/133
[48044.148457] ata3.00: device reported invalid CHS sector 0
[48044.148477] ata3: EH complete
Recap:
- No possibility of overheating
- 6 drives have failed, 4 of those have been brand new. I'm not sure now that the original two have been faulty, or suffered the same thing that the new ones.
- There is nothing common in the system, apart from the OS which is Ubuntu Karmic now (started with Jaunty). New MB, new CPU, new RAM, new SATA cables.
- No, the little holes on the hard drive are not covered
I'm crying. Really. I don't have the face to return to the store now, it's not possible for 4 drives to fail under 4 months.
A few ideas that I have been thinking: Is it possible that I mess up something when I partition and resync the drives? Can it be so bad that it physically wrecks the drive? (since the vendor supplied tool says that the drive is damaged) I do the partitioning with fdisk, and use the same block size for the raid1 partitions (I check the exact block sizes with fdisk -lu)
Is it possible that the Linux kernel or mdadm, or something is not compatible with this exact brand of hard drives, and thrashes them?
Is it possible that it may be the shoebox? Try placing it somewhere else? It's under a shelf now, so humidity is not a problem either. Is it possible that a normal PC case will solve my problem (I'm going to shoot myself then)? I will get a picture tomorrow.
Am I just simply cursed?
Any help or speculation is greatly appreciated.
Edit: The power strip is guarded against overvoltage.
Edit2: I have moved inbetween these 4 months, so the possibility of the cause being "dirty" electricity in both places, is very low.
Edit3: I have checked the voltages in the BIOS (couldn't borrow a multimeter), and they are all seem correct, the biggest discrepancy is in the 12V, because it's supplying 11.3. Should I be worried about that?
Edit4: I put my desktop PC's PSU into the server. The BIOS reported much more accurate voltage readings, and also it has successfully rebuilt the raid1 array, which took some 3-4 hours, so I feel a little positive now. Will get a new PSU tomorrow to test with that. Also, attaching the picture about the box: (disregard the 3rd drive)
4It's the opposite, they hate me. With passion. – K. Norbert – 2010-01-21T08:35:28.137
5WishCow, if drive testing occurred with a flaky power supply then it only reflects drive operation with flaky power. Many times hardware that fails with poor power supplied to it will work fine when supplied with proper power.
Frankly, bad power constitutes a HUGE fraction of all hardware problems. My first action when I suspect a bad hardware component is to try a known-good power supply... – Richard T – 2010-01-21T15:45:59.957
3A power strip will only protect you from overvoltage; it will not protect you against undervoltage. As indicated by others, a UPS (at least any worth its salt) will 'clean' dirty power because it will run from battery, instead of direct-from-the-outlet power. – Wayne Hartman – 2010-01-21T21:04:17.963
@Richard T: I'm a 100% sure that the new drives that I have bought are busted, because the guy in the shop didn't believe me, and tested the HDDs himself, on his own computer, and it reported errors there too.
Also, see edit4. – K. Norbert – 2010-01-24T23:19:17.713
2WishCow, I'm sorry for your loss but this is 100% your own doing. Without a common ground the hardware can't work properly - no point in still beating a dead horse. I'm sure it's somewhere in the specifications that each manufacturer expects an installation where the hardware they provide is connected to ground through the "chassis." And you didn't provide it. Honestly, you've got to just get on with it. The lesson is simple: "With a non-conductive chassis, the constructor must provide some path from each component to the power supply chassis." That last sentence is Gospel Truth, I swear. – Richard T – 2010-02-01T03:34:27.123
Could you please give me some directions on how should I ground the components? I'm not sure I understand the concept behind it. – K. Norbert – 2010-02-01T10:47:38.410
3Hi WishCow, you ground the components by connecting them all together with any conducting material. Traditionally, people use a "case", but you can use wires. The disk drives have lots of threaded holes for screws - these are perfect. The mother board may be a bit more tricky because it was intended to be grounded through the mounting studs in a case. They make "stand-off" fasteners that have a screw on one end and have threads in the other. You can use one of these, a screw and a nut to attach to one of the board's mounting holes, keeping your wire attachment off the board itself. -cont- – Richard T – 2010-02-07T17:02:07.150
2...Get your stand-off (you only need one) from any computer store or enthusiast - they usually give you too many when you buy a case and anyone who handles such things should have some laying around. It is important to keep the grounding wire away from the board because you don't want to accidentally ground some signal trace that's a part of the surface of the board.
Ground ALL the compoents by tying them ALL together, especially including the power supply.
Oh, just remembered: unused serial and parallel port connectors would make great mounting points for a ground, too... – Richard T – 2010-02-07T17:06:04.457
Ventilation kinda sucks in that box too. – Daniel R Hicks – 2012-03-20T20:05:18.777
Sucks to be you. Sympathy. – mtone – 2010-01-20T23:36:33.197
8why do you hate hard drives so much?! – Jeff Atwood – 2010-01-21T01:14:56.263