Troubleshooting an mysteriously unstable machine

6

I have a machine with a Core i7 CPU, 12 GiB of memory, 4 hard drives and a graphics card/sound card (both add-in PCI-E). This machine is somehow unstable, and I'm wondering how to troubleshoot the remaining issues.

Originally, the machine had an ASUS P6T SE mainboard and a 8800GT, running off a 700 W PSU, a LG DVD drive and 3 hard drives. Right when I built it, the RAM turned out to be faulty, so it got RMA'd. The sound card is a Creative X-Fi UAA. The first problem was when the 8800GT broke down, but that was easily solved by buying a new card. However, the machine would sometimes BSOD. Usually not under system load, but in idle. However, it BSODed once under load as well. Suspecting the RAM, I ran memcheck over night and no issues were found. Everything was working fine for most of the time.

Some months later (it would BSOD like once every month or so) the hard drive broke down. Classic head crash, replaced the hard drive and got the OS/data restored from backup. Now I switched the disk configuration to single system drive, then 2 disks in RAID0 and one disks for backup.

A few months later, the system started to BSOD more often (three times a day during near idle, i.e. web-browsing, RDP.) Interestingly, the machine has a WLAN USB stick and it would sometimes BSOD when I started many downloads simultaneously. Once the machine started BSOD'ing, I assumed that the mainboard might be faulty as the disk drives didn't report any problems, the graphics card just broke down and was replaced, and an additional memcheck showed no error. The original BSOD all had some message and not just a STOP ERROR CODE (for instance, I got 0x00000116 (0xfffffa800a546010, 0xfffff8801020907c, 0x0000000000000000, 0x000000000000000d) or 0x0000003b (0x00000000c0000005, 0xfffff8800138e4c7, 0xfffff8800b96c550, 0x0000000000000000).)

I replaced the mainboard with a different one, and the machine would now suddenly turn off. This led me to the conclusion that the PSU might be faulty, so I tested with a different one. The different PSU had a cable which was too short to attach it to the DVD drive, so that got cut off. With the different PSU (500 W), things were working rock-solid. I replaced the original 700 W PSU and put it back it, connected it to the DVD drive and the machine would turn off again. I removed the DVD and tested it in a different machine, and indeed, the DVD was faulty. I removed the DVD and the machine was running stable again.

A few weeks later, during gaming, the machine BSODed with Stop Error 1E without any further information. Rebooting and everything worked fine. On the same day, I wanted to run the Backup, and the backup failed with error 0x80070570 (files corrupted.) I ran chkdsk, and indeed, on my primary system drive some index ($SSI?) or so was broken, 9 files got deleted and everything was backed up. In order to check the drives, I ran three instances of HD Tune concurrently, and the machine BSOD again with 1E (0x0000001e (0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000)). Hoping that one of the drives was faulty, I ran HD Tune sequentially over night, and no error occurred. The machine didn't BSOD, and is running fine again. sfcscan also indicated no system files are broken.

As this machine has nearly everything replaced (hard drive, graphics card, memory, motherboard, PSU) or removed DVD drive; do you have any ideas how to troubleshoot what the heck is going on? The weirdest thing is that it works fine now with extreme load for hours straight, but still I had those two failures over the weekend (both under load, interestingly). Each part in isolation seems to work fine, but the combination somehow makes problems. I'm totally lost where to trouble-shoot, as every time I try to check something, the pesky thing just works fine.

Update: Just got another BSOD (1E), while reading a web site. I got the screen where a memory dump was created, progress bar going up to 100%, but after the reboot, Windows is not aware that the machine crashed. The reliability log does not show a crash. However, looking into the Minidump folder I dug out the minidump from the weekend, and the call stack has a HIDPARSE in it. Can a USB keyboard (or USB mouse) produce a bluescreen?

Update2: I replaced all hard-drive cables and reinstalled Windows. Reinstall worked fine, installing applications for 6 hours straight as well. When turning off, I got a stop error 24. I'm suspecting the primary hard drive to be unreliable (Samsung HD103SJ), as I don't see what else could be causing the problems. HDTune and chkdsk however report that the drive is OK.

Anteru

Posted 2011-04-19T14:23:56.617

Reputation: 244

3You are experiencing a highly unusual number of failures. Do you have the machine running from a UPS or power line conditioner? It's possible the electricity running into your residence is unstable and power surges/spikes are causing damage to your electronics. – BBlake – 2011-04-19T14:45:39.450

@Anteru from @BBlake's suggestion, it appears that the one constant in your problems is the power coming into the computer. (If you haven't already) try a UPS, if that doesn't solve the problem I would take out everything but the bare essentials, 1 RAM stick, just the video card, 1 hard drive. If it crashes swap out the pieces with one of the other RAM sticks/Hard drice/etc until you have a stable system. Then add components very slowly (i.e.- 1 a week) and when you start having crashes you know where to look. – Patrick – 2011-04-19T15:01:05.093

No, I haven't, and I wonder how those would be related (other electronics at home work just fine, i.e. TV and stuff.) I also have the PC connected through a fuse protected plugbar. Didn't try to get an UPS though. Any idea how to figure it out whether the power line is the source? – Anteru – 2011-04-19T15:03:23.730

@Anteru by getting a UPS :-) There are also power conditioners that don't have the UPS functionality in it so it is cheaper. I think the most direct way to check the quality of the power coming in would be an oscilloscope, though that's an expensive toy to have unless you are hardcore. – Patrick – 2011-04-19T15:15:33.303

Do you run chkdsk on a regular basis for maintenance, you should...0x1E, can be caused by bad driver, virus, or hard disk error...http://msdn.microsoft.com/en-us/library/ff557408(v=VS.85).aspx

– Moab – 2011-04-19T15:42:11.237

Any recommendations which power conditioner to use? The power supply to the house should be stable, at least nobody here or in the area ever reported issues with unstable voltage/spikes (oh and the next power plant is actually not far.) – Anteru – 2011-04-19T18:11:51.403

Answers

0

Turned out to be bad RAM + HDD. The original RAM was specified at 1.65V, (6 sticks), and even though 4-5 passes of memtest would run fine the BSODs disappeared once I switched to 1.5V RAM (3 sticks).

The hard drive was also broken, but replacing the harddrive just reduced the number of different stop codes.

Anteru

Posted 2011-04-19T14:23:56.617

Reputation: 244

2

When this happens I try to exclude the software as well. Could be a hardware/software combination.

What happens if you boot up a Live Linux CD? Knoppix, Ubuntu or whatever? Is the system able to run the Linux system for an extensive number of time without failure. Then maybe you have a software problem.

Alternatively you could try to boot start windows in fail-safe mode (does it still exist in Windows7? I am a Linux guy myself).

Ok, just a few suggestions to eliminate the reasons. Far too often I've found instable systems being the cause of software/misconfiguration rather than actual hardware problems.

Good luck!

Anders Hansson

Posted 2011-04-19T14:23:56.617

Reputation: 246

1

This sounds like a heat problem to me did you overclock the chip? You may want to use something like http://www.techpowerup.com/realtemp/ to see how hot it is getting you may just need a better heat sink and cooling system.

N4TKD

Posted 2011-04-19T14:23:56.617

Reputation: 979

No, and in fact, there is an additional case fan and the CPU has the Noctua D-14 on it (http://noctua.at/main.php?show=productview&setlng=en&products_id=34) Temperature doesn't seem to be an issue. I ran an A/V over night, and no problems.

– Anteru – 2011-04-20T05:27:24.357

Sorry that did not lead to a resolution, in line with the other comments you may want to look at what other devices are on the house hold circuit the machine is on. A heater or a hair drier being turned on may be over taxing the circuit and bring the power level (amb's) to low for the computer. – N4TKD – 2011-04-20T10:43:03.433

1

I have had similar problems with my own computers and others that I have fixed in the past. In more or less all cases where I have had similar behaviour to your system (lots of strange, seemingly unconnected problems), it has been due to one of the following two problems:

Bad power supply

Either the PSU has outputted fluctuating voltage or the actual power supplied from the grid has fluctuated. Nowdays I never buy cheap PSUs since I know how hard it can be to diagnose these kinds of problems. The wattage on the PSU is no guarantee that it is good since it might still give fluctuating power (which is usually what matters). Try running some kind of monitoring program that can display the motherboard voltages on your computer (speedfan for instance) and check if they are stable and close to the wanted values. If possible, try using a UPS so that you don't get any voltage fluctuations from the grid. Bad power supply also has a tendency to damage other components in the computer which makes it even harder to debug.

Using RAM that is not recommended by manufacturer

Some motherboards are extremely choosy when it comes to RAM. Check with your motherboard manufacturer, they usually give very detailed recommendations on what to use (brand, size, serial-number). I have had this trouble even on a pre-assembled computer, where the people who assembled it apparantly did not check this since the RAM in it was listed as 'Not recommended'. Took me quite some time to figure this out. Doing memchecks do not always find this for some reason.

Leo

Posted 2011-04-19T14:23:56.617

Reputation: 487