A practical way to do healthcheck for a new server?

Question

I am currently trying to do healthcheck for a new server in data center as a first QA before using by any application side.

"What is the practical way and tools to check that the server is fine to use?"

About the tools, I am looking at memtest86 to test the memory and IOzone Filesystem Benchmark. But I hope there should be more other tests I should do and better tools than these two.

score 6 · Accepted Answer · answered Oct 31 '13 at 23:42

I used to do a good deal of hardware troubleshooting in large datacenters, I would recommend find a bootable Linux distribution, any will do. Be sure to find one with a 64 bit image if you have a 64 bit CPU. Stresslinux contains a whole suite of tools to stress your servers, and force any hardware failures out into the open.

I'm personal to the "stress" tool - it can hammer hard disk(s), memory, and processor(s).

stress

A quick note about memtest86+

This is not part of the stress suite, but obviously you know it exists. Definitely make sure you're using memtest86+ (emphasis on +) - it handles 64 bit systems and large allocations of memory far better than the original memtest86.

Memory Testing

This will spawn n processes spinning on malloc() at 256MB each.

stress -m n &

So you would want to divide the amount of memory you have by 256MB (roughly), to hit all of it. This will flush out any obvious errors, you'll likely see EDAC's or MCE's (depending on your processor/motherboard), or a kernel panic/hard crash. Ideally you can let it run for a few hours to get the to heat up under load. You can check syslog for those errors.

CPU Testing

This will spawn n processes spinning on sqrt().

stress -c n &

You'll want n to be the number of cores in your system. The same concept here applies as far as letting it run for a while.

Disk Subsystem Testing

This requires the hard disks to be formatted in someway, if you are using RAID you'll get better results if you've already set that up prior to mounting the drives and stressing them.

Change directory to the partition/area of the disk you'd like to stress. The more free space the better.

cd /hard/disk/partition

This will spawn n processes spinning on write() at 1GB each

stress -d 32 n &

Monitor disk IO with:

iostat -x 5

Killing Stress

To kill all stress processes, yes you can run memory, CPU, and hard disk checks simultaneously, but it makes isolating the components a bit tougher:

pkill -9 -f stress

Verifying Your Tests

Obviously you run all these things and you need to see some kind of result or confirmation.

Memory/CPU

You'll just want to check syslog for things like, Machine Check Exception (MCE), Error Detection and Correction (EDAC), Out of Memory (OOM), etc.

zgrep -i -P ".*(error|warn|fail|panic|edac|mce|exception|oom-killer|oops).*" /var/log/kern.log* /var/log/dmesg /var/log/daemon.log*

Disk Subsystem

This is probably the easiest to detect more than any other component, smartctl is usually installed on most *NIX operating systems, it's a part of the smartmontools package. The following command requires root access:

The /dev/sda aspect may need to change if you have multiple disks or a RAID controller.

sudo smartctl -a /dev/sda

Below is some sample output of the full command, I'll specify what you should really be looking for farther down the page.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate     0x000b   099   099   016    Pre-fail  Always       -       65537
2 Throughput_Performance  0x0005   136   136   054    Pre-fail  Offline      -       95
3 Spin_Up_Time            0x0007   121   121   024    Pre-fail  Always       -       320 (Average 304)
4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       18
5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
8 Seek_Time_Performance   0x0005   144   144   020    Pre-fail  Offline      -       28
9 Power_On_Hours          0x0012   098   098   000    Old_age   Always       -       15407
10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       18
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       37
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       37
194 Temperature_Celsius     0x0002   253   253   000    Old_age   Always       -       22 (Min/Max 15/31)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

The main attributes you want to be aware of are the following:

Reallocated_Sector_Ct

This should be less than 1000 in general.

These are sectors that had to be reallocated to another part of the hard disk because the previous sector wasn't healthy.

Temperature_Celsius

This should be less than 55, the command itself will have a min/max in some cases - but sub 55 is where we saw healthy disk operation.

Current_Pending_Sector

Must be 0.

Offline_Uncorrectable