What to check during a periodic system health check

Question

I have been tasked with preparing a list of checks to do as a part of a weekly system health check routine which my team is supposed to do. The problem is neither I nor any of my colleagues have ever been a professional system administrator and the best we can come up with is pretty laughable.

The system runs Siemens SIMATIC IT and LIMS, but I’m interested in some generic checks/tests for operating systems and database servers. Someone else will take care of tests specific to the applications being run.

The setup is a follows:

All servers are virtual, running in the vSphere5 environment.

Web server – MS Windows Server 2003 R2
2 servers running SIMATIC IT components one for Historian and one for the Production Modeler and other components – MS Windows Server 2003 R2
Database server – MS Windows Server 2003 R2 + MS SQL Server 2005
Database + LIMS server – MS Windows Server 2008 R2 + Oracle Database 11g

Most probably we will not get access to the vCenter console, so the idea is to connect a remote desktop to those servers, make some constructive checks/ tests and prepare a report.

As I wrote already, there is no much besides checking for a free disk space, that I can come up with. I can also think of checking the level of fragmentation of a file system and file system errors with ChkDsk, looking into windows event viewer for some important errors and warnings, checking the level of index fragmentation in databases and maybe collecting some statistics of response times and times of execution of some important queries.

I will greatly appreciate any help. Besides information about what should be checked, hints for what not to do on a system that is under load 24/5 will also be very helpful. For example running a defragmenter even just for analysis on a database server under load might be a very bad idea, but I don’t know it yet.

Thank you.

I agree with voretaq below, but I'd like to append to his list to take a regular look at the event logs in windows as well. They are there to help you, use them! Monitoring is a much better way to go. And there are free and commerical products that both do a good job (We use Orion NPM and Spiceworks, FWIW) — MikeAWood, Jan 26 '13 at 01:27
@MikeAWood In an ideal world you would centralize the logs and have an automated process on the log-collecting server notify you when it receives something unusual. Periodic checks can be useful too, but I usually don't look at my logs unless I suspect a problem (and of course during routine maintenance/patching windows, so I know what "normal" looks like) — voretaq7, Jan 28 '13 at 18:54
@voretaq7 I am just as bad about checking the logs unless I am looking for something or have a reason to check. One of those bits of advice many of us give but rarely follow. — MikeAWood, Feb 01 '13 at 22:56

score 9 · Accepted Answer · edited Apr 13 '17 at 12:14

You are being asked to do it wrong.

You should not be logging in to production systems and doing periodic manual checks.
This guarantees that you will (a) miss something that happens between the checks and takes your business down, and (b) eventually screw up while doing the checks and take the business down.

Instead, you should be implementing a monitoring system that does continuous periodic checks (every 5-10 minutes) and reports anomalies to you. See the monitoring tag for more information and ideas on what to check.

Disk space, swap utilization, and CPU load (RunQ depth) are typical things to monitor. You may also want to perform (and time/check the output of) standard test queries on database servers (these queries are something you have to create based on your environment).

score 1 · Answer 2 · answered Feb 19 '13 at 11:01

For Servers running on Windows OS, important checks could be:

CPU Utilisation.
RAM Utilisation.
Hard Disk space Available.
Web Server (IIS) Service running or not.

From networking perspective:

Well configured DNS
IP From DHCP

This might be useful ...

score 0 · Answer 3 · answered Jan 28 '13 at 21:24

I would add something else to the list, because this is a Web server.

set up a scheduled task to COUNT the number of "200", "500", "401", and "503" responses in the IIS logs - you can use LOGPARSER to do this. The idea is, the script would count the number of occurrences of each, and then divide the number of 500 and 503 responses by the number of 200 responses. This will give you overall health of the web server's response performance, as a ratio of failure (500)/ success (200).
- 500 - Error - the web call failed
- 503 - Timeout - the web proxy never received a response from the upstream web server
- 401 - Unauthorized - the web call didn't authenticate
- 200 - Success - the web call was processed with no errors thrown

Then the script should upload the results (inc the raw data) to a central reporting system, so that you can examine it w/o having to log in locally.

If you need more in-depth examination of the logs (say, what app pool is doing badly if applicable) there are many other things you can throw at LOGPARSER to dig this stuff up.

What to check during a periodic system health check

3 Answers3