4

A single server has started to sometime leave zombie w3wp.exe processes when trying to recycle. A new process is spawned properly and everything seems to work, except the old processes are still present and take up memory. Task manager reports there's only a single thread left, far from the active ones that have between 40 and 70 threads usually.

Using ProcDump I've taken a full memory dump to analyze further in WinDbg. The machine is a Server 2008 R2 x64 8 core machine as stated by WinDbg:

Windows 7 Version 7600 MP (8 procs) Free x64

After loading sos a printout of the managed threads reveals the following:

0:000> !threads
ThreadCount: 19
UnstartedThread: 0
BackgroundThread: 19
PendingThread: 0
DeadThread: 0
Hosted Runtime: no
                                              PreEmptive                                                Lock
       ID OSID        ThreadOBJ     State   GC     GC Alloc Context                  Domain           Count APT Exception
XXXX    1  9d0 000000000209b4c0      8220 Enabled  0000000000000000:0000000000000000 000000000208e770     0 Ukn
XXXX    2  c60 00000000020c3130      b220 Enabled  000000013fbe5ed0:000000013fbe7da8 000000000208e770     0 MTA (Finalizer)
XXXX    3  a24 00000000020f0d60   880a220 Enabled  0000000000000000:0000000000000000 000000000208e770     0 MTA (Threadpool Completion Port)
XXXX    4  97c 0000000002105180    80a220 Enabled  0000000000000000:0000000000000000 000000000208e770     0 MTA (Threadpool Completion Port)
XXXX    5  c28 000000000210bfe0      1220 Enabled  0000000000000000:0000000000000000 000000000208e770     0 Ukn
XXXX    6  d40 00000000053f9080   180b220 Enabled  00000001bfe75d20:00000001bfe767c8 000000000208e770     0 MTA (Threadpool Worker)
XXXX    7  c18 00000000053f9b30   180b220 Enabled  00000000fff95880:00000000fff97210 000000000208e770     0 MTA (Threadpool Worker)
XXXX    8  f7c 00000000053fa5e0   180b220 Enabled  000000011fbea268:000000011fbea920 000000000208e770     0 MTA (Threadpool Worker)
XXXX    9  91c 00000000053fb090   180b220 Enabled  00000001dfc39138:00000001dfc39670 000000000208e770     0 MTA (Threadpool Worker)
XXXX    a  fb0 00000000053fbd20   180b220 Enabled  00000000fff922b0:00000000fff93210 000000000208e770     0 MTA (Threadpool Worker)
XXXX    b  fc8 00000000053fc9b0   180b220 Enabled  0000000160053ea0:0000000160054778 000000000208e770     0 MTA (Threadpool Worker)
XXXX    c  538 00000000053fd460   180b220 Enabled  000000017fd8fc98:000000017fd911f8 000000000208e770     0 MTA (Threadpool Worker)
XXXX    d  604 00000000053fdf10   180b220 Enabled  000000019fd7aa78:000000019fd7c648 000000000208e770     0 MTA (Threadpool Worker)
   0    f  2cc 0000000005514c60       220 Enabled  0000000000000000:0000000000000000 000000000208e770     0 Ukn
XXXX   10  9bc 00000000020a90c0       220 Enabled  0000000000000000:0000000000000000 000000000208e770     0 Ukn
XXXX   11  9c0 00000000056b7a00       220 Enabled  0000000000000000:0000000000000000 000000000208e770     0 Ukn
XXXX    e  9d4 00000000056b7fd0       220 Enabled  0000000000000000:0000000000000000 000000000208e770     0 Ukn
XXXX   12  9d8 00000000056b85a0       220 Enabled  0000000000000000:0000000000000000 000000000208e770     0 Ukn
XXXX   13  cb8 00000000056b8b70       220 Enabled  0000000000000000:0000000000000000 000000000208e770     0 Ukn

Of more interest however is probably the output of a stack backtrace for the single unmanaged thread remaining:

0:000> ~* kb 2000

.  0  Id: 85c.2cc Suspend: -1 Teb: 000007ff`fffd3000 Unfrozen
RetAddr           : Args to Child                                                           : Call Site
000007fe`fdcc1843 : 00000000`00fd6b60 00000000`00fd6b60 ffffffff`ffffffff 00000000`77bc04a0 : ntdll!ZwClose+0xa
00000000`77ab2c41 : 00000000`77bc1670 00000000`00000000 00000000`77bc04a0 7fffffff`ffffffff : KERNELBASE!CloseHandle+0x13
000007fe`f56537c6 : 00000000`00000000 00000000`00000000 00000000`012da080 000007fe`f5442eac : kernel32!CloseHandleImplementation+0x3d
000007fe`f54443d2 : 00000000`00000007 000007fe`f5443d3c 00000000`00000000 00000000`77bc9997 : httpapi!HttpCloseRequestQueue+0xa
000007fe`f54444c3 : 00000000`00000000 00000000`012e6900 00000000`00000000 00000000`77bd5afa : w3dt!UL_APP_POOL::Cleanup+0x62
000007fe`f549384a : 00000000`012da080 00000000`00c93a28 00000000`012e6900 00000000`00000000 : w3dt!WP_CONTEXT::CleanupOutstandingRequests+0x83
000007fe`f549417a : 00000000`00000000 00000000`0000ffff 00000000`00000000 00000000`77bcf9fd : iiscore!W3_SERVER::StopListen+0x4a
000007fe`f562b5bf : 00000000`012d2f30 00000000`00000000 00000000`00000000 00000000`0000ffff : iiscore!IISCORE_PROTOCOL_MANAGER::StopListenerChannel+0x5a
000007fe`f5626e8f : 00000000`012d2f30 00000000`00000000 00000000`00424380 00000000`00000000 : w3wphost!LISTENER_CHANNEL_STOP_WORKITEM::ExecuteWorkItem+0x7b
00000000`77bcf8eb : 00000000`021782b0 00000000`021782b0 00000000`00000000 00000000`00000001 : w3wphost!W3WP_HOST::ExecuteWorkItem+0xf
00000000`77bc9d9f : 00000000`00000000 00000000`012d2f30 00000000`00424380 00000000`010aa528 : ntdll!RtlpTpWorkCallback+0x16b
00000000`77aaf56d : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!TppWorkerThread+0x5ff
00000000`77be3281 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : kernel32!BaseThreadInitThunk+0xd
00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!RtlUserThreadStart+0x1d

From the stack trace it's obvious that the w3wp process is trying to shut down and perform its cleanup tasks, but for some reason ntdll!ZwClose is hanging up. It's been hung for several days without change - and without apparent side effects besides an increased amount of memory usage.

The w3wp processes do not hang up all the time, I have yet to find a reproducible pattern. In the meantime, any suggestions for further debugging?

Mark S. Rasmussen
  • 2,108
  • 2
  • 21
  • 31
  • Were you ever able to determine what this is from? I'm seeing the same behavior and so far I've found nothing that helps. – mpeterson Mar 20 '11 at 21:11
  • Unfortunately I haven't been able to solve this yet. I'm also seeing this on multiple servers (clean installs), so either it's our code, or we're hitting the same bug on multiple servers. I've yet to open a PSS case for it as it's not a critical issue for us, just an annoyance - especially so since I can't explain what's going on. – Mark S. Rasmussen Mar 29 '11 at 07:48

3 Answers3

1

Impressive research.

Check RSCA to see if it still has a handle to that app pool and can tell you if there are any running pages still. It may turn up a pattern or lead. You can drill into that at the top level of IIS and open "Worker Processes" and then double-click on the app pool if it shows up there.

Scott Forsyth
  • 16,339
  • 3
  • 36
  • 55
  • What's RSCA? The IIS manager does not show the hung w3wp processes so it seems to be detached / not serving any appdomains any longer. – Mark S. Rasmussen Nov 02 '09 at 10:00
  • Runtime Status Control API (RSCA) is new in IIS7. Even though the acronym mentions that it's an API, it's what's used in IIS 7 manager too. It sounds like it's not going to help in this case though since you confirmed that it doesn't show the app pool in IIS Manager. – Scott Forsyth Nov 02 '09 at 17:01
1

Did the problem appear at the same time as a new version of the web site was deployed?

As the worker process is being closed down, the objects are being cleaned out of memory. If a developer has written code which runs when the object is being "finalized" / "Disposed", and this code throws an exception, the object will not be removed from memory. If you cannot remove all objects from memory this may block the shutting down the worker process.

Then there is the problem of why it does not happen everytime. It could be that this code is in a part of the system that is not used often, and therefore the type of object that causes this problem is not allways present.

The way to test this would be to:

  • start the system
  • use a small part
  • manually recyle the application pool
  • check for zombie process
  • if not check a different part of the system
  • ....

You could also check with the developers if they have any special code for cleaning up objects.

Shiraz Bhaiji
  • 2,219
  • 8
  • 34
  • 47
  • Not a new version. Brand new server that's setup with the same sites as our other servers are, happened from the beginning. If it was some of our code hung in finalization, shouldn't there be a managed thread? There's only a single thread left and it's unmanaged. – Mark S. Rasmussen Nov 02 '09 at 10:01
  • Then it is something specific to that server installation. Could be related to lack of Windows Update / Configuration / User rights. – Shiraz Bhaiji Nov 02 '09 at 13:58
0

In IIS 7.0, WWW service no longer manages worker processes. Instead, WWW Service is the listener adapter for the HTTP listener, HTTP.sys. As the listener adapter, WWW Service is primarily responsible for configuring HTTP.sys, updating HTTP.sys when configuration changes, and notifying WAS when a request enters the request queue.

What specifcally are you running on this server? Applications pools, are they integrated mode or classic mode?

Nasa
  • 316
  • 1
  • 7
  • I'm running ~500 ASP.NET applications across three identical application pools. Using integrated mode, running under a custom AD user that has all necessary rights through aspnet_regiis -ga. No other services or applications is running. – Mark S. Rasmussen Nov 13 '09 at 11:18
  • Does that same thing happen when you run the applications in classic mode? – Nasa Nov 13 '09 at 13:46
  • Sorry for the slow response. I'm not able to test it in classic mode unfortunately as the app's written to run in integrated mode. – Mark S. Rasmussen Dec 10 '10 at 08:02