I used to do a good deal of hardware troubleshooting in large datacenters, I would recommend find a bootable Linux distribution, any will do. Be sure to find one with a 64 bit image if you have a 64 bit CPU. Stresslinux contains a whole suite of tools to stress your servers, and force any hardware failures out into the open.
I'm personal to the "stress" tool - it can hammer hard disk(s), memory, and processor(s).
stress
A quick note about memtest86+
This is not part of the stress suite, but obviously you know it exists. Definitely make sure you're using memtest86+ (emphasis on +) - it handles 64 bit systems and large allocations of memory far better than the original memtest86.
Memory Testing
This will spawn n processes spinning on malloc() at 256MB each.
stress -m n &
So you would want to divide the amount of memory you have by 256MB (roughly), to hit all of it. This will flush out any obvious errors, you'll likely see EDAC's or MCE's (depending on your processor/motherboard), or a kernel panic/hard crash. Ideally you can let it run for a few hours to get the to heat up under load. You can check syslog for those errors.
CPU Testing
This will spawn n processes spinning on sqrt().
stress -c n &
You'll want n to be the number of cores in your system. The same concept here applies as far as letting it run for a while.
Disk Subsystem Testing
This requires the hard disks to be formatted in someway, if you are using RAID you'll get better results if you've already set that up prior to mounting the drives and stressing them.
Change directory to the partition/area of the disk you'd like to stress. The more free space the better.
cd /hard/disk/partition
This will spawn n processes spinning on write() at 1GB each
stress -d 32 n &
Monitor disk IO with:
iostat -x 5
Killing Stress
To kill all stress processes, yes you can run memory, CPU, and hard disk checks simultaneously, but it makes isolating the components a bit tougher:
pkill -9 -f stress
Verifying Your Tests
Obviously you run all these things and you need to see some kind of result or confirmation.
Memory/CPU
You'll just want to check syslog for things like, Machine Check Exception (MCE), Error Detection and Correction (EDAC), Out of Memory (OOM), etc.
zgrep -i -P ".*(error|warn|fail|panic|edac|mce|exception|oom-killer|oops).*" /var/log/kern.log* /var/log/dmesg /var/log/daemon.log*
Disk Subsystem
This is probably the easiest to detect more than any other component, smartctl is usually installed on most *NIX operating systems, it's a part of the smartmontools package. The following command requires root access:
The /dev/sda aspect may need to change if you have multiple disks or a RAID controller.
sudo smartctl -a /dev/sda
Below is some sample output of the full command, I'll specify what you should really be looking for farther down the page.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 099 099 016 Pre-fail Always - 65537
2 Throughput_Performance 0x0005 136 136 054 Pre-fail Offline - 95
3 Spin_Up_Time 0x0007 121 121 024 Pre-fail Always - 320 (Average 304)
4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 18
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0
7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0
8 Seek_Time_Performance 0x0005 144 144 020 Pre-fail Offline - 28
9 Power_On_Hours 0x0012 098 098 000 Old_age Always - 15407
10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 18
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 37
193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 37
194 Temperature_Celsius 0x0002 253 253 000 Old_age Always - 22 (Min/Max 15/31)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0
The main attributes you want to be aware of are the following:
Reallocated_Sector_Ct
This should be less than 1000 in general.
These are sectors that had to be reallocated to another part of the hard disk because the previous sector wasn't healthy.
Temperature_Celsius
This should be less than 55, the command itself will have a min/max in some cases - but sub 55 is where we saw healthy disk operation.
Current_Pending_Sector
Must be 0.
Offline_Uncorrectable
Must be 0.
UDMA_CRC_Error_Count
Should be 0, a few are okay (less than 100).
Also note these can be indicative of a bad SATA/SAS cable.
There are many ways to stress and benchmark your servers, this is just a way I'm comfortable with and have had great results with.
I hope this helps!