1

I've inherited a website that uses a lot of session state. We've recently experienced continuous high CPU ~95-100% for prolonged period of times.

When debugging using DebugDiag, it shows that there was ~3gb on the Large Object Heap which I believe is collected in Gen 2 by the GC and could be a cause of the high cpu.

I have practically zero experience debugging such scenarios, but does the above sound a plausible reason for the high CPU?

Thanks.

Steve
  • 121
  • 1
  • 5
  • Did you check whether you got application pool restarts due to high memory? That would be more likely - pool restarts, compiler has to do a lot of work again. 33gb is a lot for large object heap.... find out what is there and fix that. – TomTom Dec 08 '14 at 16:37
  • @TomTom It's 3gb for the LOH, not 33gb. Is that still a high amount? I'll look at the app pool restarting. Thanks – Steve Dec 08 '14 at 16:39
  • It is. IIS recommendation is using 32 bit for application pools and 3gb is their memory limit. You can switch to 64 bit and/ or use a web garden setup (1 application pool instance per socket). – TomTom Dec 08 '14 at 16:41
  • @TomTom Thanks, I thought that may be the case. It gives me a starting point at least. – Steve Dec 08 '14 at 16:43
  • You could think about using the .NET State Service rather than using inproc as well. This would actually be required for the web garden support TomTom mentioned. – Nathan C Dec 08 '14 at 17:24

1 Answers1

1

You can verify if GC is the issue by using Performance Monitor and the '.NET Memory\% Time in GC' performance counter. If you only have one .NET process on the server, you can just use the _total instance. Otherwise you'll have to find the instance that has a matching process ID and watch that one (though be aware that the instance name for you application can change on the fly if any apps start up or shut down).

If spikes in this counter correspond to the CPU spikes, garbage collection is your issue--you will need to look for leaks, allocate fewer objects, keep things small enough to keep them out of the LOH, keep them around less time, reuse them, and/or eliminate destructors. Each of these things will reduce time spent locked up in GC. Ironically, too much caching can make your site inconsistently unresponsive, as cached items eventually end up in heap 2, and request processing pauses while the GC sweeps through every item in heap 2. As memory pressure increases, the frequency of these lockouts increases until eventually your requests get completely starved out.

James
  • 363
  • 2
  • 4
  • 16
  • Thanks for the answer. This definitely correlates with what I have since seen with the performance counters. There's definitely a link between the % time in GC and the CPU spikes. There's also far too many generation 2 collections happening. I believe the ideal scenario is a factor of 10 between each generation, but we are experiencing far worse than this. I suspect it's a poor use of the session state that's the culprit along with a few large objects on the LOH. For now, we've thrown some new hardware at it which has alleviated the problem and given us time to fix the actual issue. – Steve Dec 13 '14 at 08:36
  • Thanks! For anyone else who may stumble along here--I bet your new hardware has faster CPU, but little or no more memory. More memory actually makes the problem worse most of the time, because you end up with more in the cache, so it takes **longer** to GC. We bought servers with 128GB of RAM thinking that more cache can ony help. How wrong we were--we ended up with 10-20 second freezes in the application when the cache usage went above 64GB. – James Dec 13 '14 at 16:33
  • We have upgraded both CPU and RAM, more than double the RAM. Although we only did this on Thursday evening, so haven't had much time to monitor it. I'll report back how it goes over the next week or so. – Steve Dec 13 '14 at 16:46
  • Yes, I'd like to hear how it went. Even though it sounds completely counterintuitive, I suspect you'd be better off taking out half the new RAM. More memory will give you better performance at the start because GC decides there's plenty of RAM so there is no need to run. Once it hits the threshold where it decides it needs to start cleaning up, things start to slow dramatically. You really need to examine what is using the RAM and reduce the number of objects that hang around for a long time. We ended up putting a time limit of between 5 and 15 minutes on everything we put into the cache. – James Dec 13 '14 at 16:57
  • Thanks James. I'll monitor it over the next few days. I suspect it's the amount of session state used in the website, which was developed around 5 years ago by an outside agency and then built upon by successive developers. At this point, there's an awful lot of stuff stored in the session. I've reduced the timeout as much as I dare (15 minutes). It definitely needs re-architecting. – Steve Dec 13 '14 at 19:25