8

We recently had an issue on our live server that caused our Web App to stop responding. All we were getting were 503 errors until we rebooted the server then it was fine. Eventually I traced it back to the httperr.log and found a whole lot of 1_Connections_Refused errors.

Further investigation seemed to indicate that we had reached the nonpaged pool limit. Since then we have been monitoring the nonpaged pool memory using Poolmon.exe and we believe we have identified the tag that is causing the problem.

Tag   Type    Allocs       Frees       Diff       Bytes      Per Alloc
Even  Nonp  51,231,806   50,633,533   684,922   32,878,688      48

If we use poolmon.exe /g it shows the Mapped Driver as [< unknown >Event objects].

This is pretty much no help at all. My team has spent considerable time researching this problem and haven't been able to find a process to narrow this down to a specific application or service. I get the sense that most people seem to solve the problem by killing processes on the machine till they see the nonpaged memory reset. This is not exactly what you want to see when working on a production machine.

If I open up Task Manager and view the process list. I see MailService.exe with an NP Pool value of 105K this is 36K higher than the value of the process listed second. As we have had some problems with our Mail Server in the past (which may or may not be related to this issue) my gut feeling is that this is causing the issue.

However, before we go off restarting services, I'd like to have a little more certainty than just a "gut feeling".

I've also tried using poolmon.exe /c but this always returns the error:

unable to load msvcr70.dll/msvcp70.dll

and it doesn't create localtag.txt. My colleague had to download pooltag.txt from the internet because we can't figure out where it is located. We don't have win debugger or the win DDK installed (that I can see). Maybe the above error is given because we don't have either of these installed - but I don't know.

Finally I tried:

C:\windows\system32\driver\findstr /m /l Even *.sys

This returned a fairly sizeable list of .sys files and again wasn't at all helpful with the problem at hand.

So my question is this: Is there any other way to narrow down the cause of this memory leak?

UPDATE:

As suggested below, I have been logging the Pool Nonpaged Bytes for the last day or so to see if any process is trending up. For the most part all of the processes appear to be fairly static in their usage. Two of them look to have ticked up slightly. I will continue to monitor this for the next few days.

I also forgot to mention earlier that none of the processes appear to be using an excessive number of handles either.

UPDATE 2:

I have been monitoring this for the last couple of weeks. Both the Nonpaged Bytes Pool for individual processes and the total Nonpaged Bytes Pool have remained relatively stable during that time. During this time Windows was updated and the server rebooted so I am wondering if that has solved the problem. I am definitely not seeing the consistent growth in the Nonpaged Bytes Pool now that I was prior to this.

Developer
  • 273
  • 1
  • 3
  • 11
  • Why not use perfmon to monitor Pool Nonpaged Bytes for all processes and look for the process with runaway nonpaged pool memory? – joeqwerty Nov 30 '11 at 12:54
  • I have just had a bit of a play with Performance Monitor and set it up to do as you have suggested. However it doesn't really tell me anything that I didn't already know from looking at Task Manager. MailService has the highest usage of Nonpaged Pool but it is only at 106K. So it's not exactly the smoking gun I was looking for. – Developer Dec 01 '11 at 02:42
  • Look for increasing Pool Nonpaged Bytes in the processes over time. It may not be readily apparent by taking a quick view of the usage by process at any one moment in time. You can easily capture the usage over time by setting up a Counter log to save to a CSV file and open that with Excel to analyze escalating usage per process. Any process that exhibits a 10% or more increase in Pool Nonpaged Bytes from system startup is leaking memory and is likely the process causing the problem – joeqwerty Dec 01 '11 at 02:54
  • A handy tool to help capture and analyze the relevant counter data is the PAL tool, found here: http://pal.codeplex.com/releases/view/51623#ReviewsAnchor. This is a newer version than I've used but there's an x86 version and it looks like it can be used on W2K3. – joeqwerty Dec 01 '11 at 03:05
  • I have set up a log file to record the NP Pool Bytes. Poolmon is now saying my nonpaged memory usage is 68MB. It has grown by about 2-3MB in the couple of hours that I have been trying to figure this out. But there is no corresponding growth (that I can see) in the NP values for the processes. In fact the NP Pool values against the individual processes are nowhere near this number. Even if I added up all the listed np pool values the total would be lucky to be 1MB not 68MB. But maybe I am missing something here. – Developer Dec 01 '11 at 05:43

1 Answers1

6

I have been monitoring this for about 6-7 weeks now and can finally give a definitive answer to the problem.

Firstly the Nonpaged Bytes for individual processes didn't really tell me anything useful as they all appeared to be fairly static in their usage. There were spikes but the usage always returned to the base line afterward.

The Nonpaged Bytes Memory total was static for awhile also but then started gradually increasing and then spiking. After a spike about half the memory was freed and then it remained static again (at the higher level) for awhile until the pattern repeated. Looking at the graph I noticed that these spikes seemed to be fairly regularly spaced and as it turns out they were happening 2 weeks apart and always on a Sunday.

So the next question was: What is running on bi-weekly on Sundays? I went had a look in Event Viewer and every time a spike occurred McAfee was running. I also think by logging onto the server frequently to monitor the issue we inadvertently made the problem worse because McAfee has a real time scanner and I believe this was causing the smaller increases we were seeing.

I think that the scans being scheduled tasks also explains why we saw the NP Memory increases attached to the Event objects tag in PoolMon instead of the the McAfee specific tag. This was the main thing that really led us down the garden path.

Now that we finally know what is causing the leaks we can do something about it. It's incredible that it took this long to track it down though.

UPDATE: Just as a final note. McAfee's was updated on the weekend and this completely resolved our Non-Paged Memory problem.

UPDATE 2: Since I just got an up vote for this, I'll add a further update to this. Initially the update to McAfee did appear to fix our problem i.e. we no longer see the massive spikes in NP Memory at regular intervals. I have also noticed that since the update it seems McAfee no longer writes logs to the Event Viewer by default now, which hides when it is actively scanning.

But we are still seeing gradual increases in NP memory usage. It's gotten to the point where we now need to reboot our server every 2 weeks or so. It's so bad that we recently acquired a new server in the hope that updated hardware and software will make this problem go away BUT our completely new server with only Windows Server 2008, SQL Server 2008 R2, and McAfee installed was STILL showing a NP Memory leak. It was only after I completely removed McAfee that the leak stopped and it has remained static even after we set up the server with all our software in preparation to switch over to it.

I have since read, and I don't know if this is true, that the problem isn't with McAfee, but with some Windows routine that McAfee uses that causes NP Memory to leak. Apparently, network activity is the cause of the leak i.e. more network activity => bigger leaks. This does seem to be consistent with our experience, in that the leak has gotten worse as our server has gotten busier.

Developer
  • 273
  • 1
  • 3
  • 11