How to diagnose Windows hangs - which resource is blocked?

Question

We have a Windows 7 Pro machine running Apache/php/postgres processing Ajax requests under a constant load (several every second). Its also running various other applications, which perform lots of disk writes.

Usually Ajax responses are received in under a second, but occasionally (~once in a 24h period) for a period of upto 15 seconds no responses are sent, and then they are all sent at the end, i.e. it appears the server is blocked for upto 15 seconds. This causes Ajax to timeout on the client side.

Logs from Apache and other applications back this up. Perfmon shows various counters drop to zero/near-zero - HD activity, CPU activity, network activity, etc. httpd#1 seems to be the only process which still has some CPU activity, albeit reduced.

How can I determine the cause of the hang? Can perfmon or another tool tell me what resource is blocking? (Is the 'Windows Performance Toolkit' or 'Process Monitor' any good for this?)

NB Apache has ample threads, postgres ample connections, CPU and RAM are not maxed out, and we've tried power options, drivers, sfc /scannow, chkdsk /r, memtest, etc.

Update 22/03/2013 10:26:

Thanks for all your responses so far. More information:

Hardware:

Chassis: Westek 2U Rack Mount Motherboard: Intel Q35 1333FSB (5xPCI, 2xPCI-E, SATA II I/F, VGA I/F, 2xRS232, etc)
RAM: 2x 2GB DDR2 PC2-5300 non-ECC CL4 240 pin Memory Module (3GB usable as 32-bit OS)
Processor: Intel Core2 Quad Q9550.2.83GHz 1066FSB 12MB Cache
Storage: 2x Hitachi 320GB SATA 16MB Cache 7200 NCQ in SATA-II RAID Box - Intel Raid 1, NTFS
Power: 2x 400W PSU - dual redundant
Modem: StarTech external v.92 56k USB Fax Modem
PCI card: Telephony card

OS:

Windows 7 Pro SP1 32-bit

Advanced Performance Options:

(System Properties > Advanced > Performance > Settings > Advanced)

Processor scheduling: best performance for programs
Virtual memory: Automatically manage paging file size for all drives
- Total paging size for all drives:
- Minimum allowed: 16 MB
- Recommended: 4591 MB
- Currently allocated: 3061 MB

Update 22/03/2013 11:46:

Screenshot from perfmon:

http://i46.tinypic.com/fndyit.png (I don't have enough reputation to embed it in the post)

The period during which the server is unresponsive is 07:44:15 - 07:44:22 - whilst the CPU drops below 20%. (NB this is from another server with weaker CPU and older unoptimized software - usually CPU is not this high!)

Update 04/04/2013 16:53:

We found the culprit - the HDD. Only took a month!

How we got there:

Process Monitor confirmed that the disk was blocking on all writes during the incidents. We first tried updating the RAID drivers. This improved things - the CPU,etc wouldn't completely drop to zero, but the disk was still blocking. We then tried disabling RAID - this had no effect. We tried reducing the disk usage by disabling various logging and this helped. We then tried swapping the HDD for another (of lower spec), using the image from the first, and the problem completely disappeared.

So what was wrong with our HDD?

The disk we were using was a "Hitachi TravelStar 7k500 (Enhanced Availability variant)". It appears that the duty cycle has been limited to ensure ‘Enhanced Availability’ for this model, which may not suitable for particularly heavy disk usage. According to Resource Monitor, our disk usage is around 400KB/sec.

Why do you have a consumer OS acting as a server? Even though they run the same code (mostly), they are optimized for different things, so the performance of background processes on Windows 7 won't be as good as it would on the same hardware running Windows Server. — mfinni, Mar 21 '13 at 13:37
I understand comments about Windows 7, but still wouldn't expect 15 seconds hangs under moderate load. — Sam, Apr 09 '13 at 14:54

score 0 · Answer 1 · answered Mar 21 '13 at 13:31

0

This really sounds like a storage issue. What kind of storage are you using for the pagefile ?

Otherwise, the best tool I know to diagnose that kind of issues is procmon from sysinternals (MS now). It has the ability to perform long sessions as well but you'll have to have a way to identify the exact time frame when you experience the issue, in particular if you're going for a full system monitor. If it's not a page file issue, then it most likely will allow you to find the culprit.

answered Mar 21 '13 at 13:31

Stephane

6,382
3
25
47

Have updated my question to include hardware. We're running some HDD diagnostics (HDTune), and may try imaging to another HDD. We've tried the same tests on two other servers, one built to the same and one not. The one built the same reproduced the problem and the other did not. – Sam Mar 22 '13 at 10:45

mfinni · Accepted Answer · 2013-03-22T14:34:46.667

Yes, Perfmon can monitor the performance of just about everything. The problem is that you need to know where to look. The defaults are a good starting point, but for real problems, you need to put in some work to figure it out.

Assuming local storage, check the PhysicalDisk\Avg. Disk Queue Length in PerfMon. If it goes higher than your number of spindles, your storage system is a (or the) bottleneck. Describe your hardware for us, too.

/edit There you go. Your disk queue length climbs way above "2" (the number of slow spindles that you have) pretty often, and is at that level during the period you name. CPU usage drops then, probably because it's waiting on IO and can't do anything so it waits.

Potential improvements:

Naively, move the storage to more and/or faster disks. RAID 10 perhaps.
More smarter - benchmark what's hitting the disk system and split those onto different spindles, or different servers entirely. Typically, one doesn't want the website or other front-ends sharing too many resources with a SQL database backend; the two types of processes have wildly-different performance characteristics.

I've updated my question to include my hardware and a perfmon screenshot including Avg Disk Queue Length. — Sam, Mar 22 '13 at 11:49
I accepted your answer as you suggested changing the disk. It turned out our disk was the culprit - see my updated question. — Sam, Apr 04 '13 at 15:58

How to diagnose Windows hangs - which resource is blocked?

2 Answers2