6

X-Post from StackOverflow:

https://stackoverflow.com/questions/9465123/intermittent-high-cpu-100-on-production-webserver

We have a web cluster with 3 web-servers, each with 24 cores & 24GB mem.

Our application is latest patched ASP.NET 4.0, With MVC3, on IIS 7.5 - In it's own application pool.

Very intermittently, (Maybe once every 2/3 days) one of the webservers will stop serving requests, and all 24 cores will show 100% CPU (memory & disk look normal).

The few times when IIS manager isn't completely frozen, the active running requests don't seem to offer any useful information, with a pretty random spread across a large number of site areas/requests.

Once a server has died, we are able to take it out of load - and after maybe 5 minutes of no-longer serving requests, the CPU activity will drop back to normnal - making us think it isn't an infinite loop.

A memory dump of the worker process (around 4GB is size!) doesn't seem to show any of our code/namespaces anywhere in any of the managed stack traces - but simply .Net begin request stuff (It's possible I'm using WinDbg wrong - and not loading our symbols correctly - but the stack traces don't show any missing/unnamed method calls - so I'm quite confused)

Our servers are normally processing 1000 req/sec quite happily, so this is all very strange.

One weird thing we noticed in Perfmon - was the Contention Rate / sec goes to like 800. We don't have any fancy multi-threaded code in our app, and the only locks we have are in our caching code (Which hasn't changed in ages).

Any advice/tips on how to further diagnose this issue would be most appreciated.

Cheers.

Dave
  • 161
  • 1
  • 1
  • 4
  • 2
    Juts a possible reason: ASP.NET recycles app pools periodically. When there are a lot of request, during this recycle IIS may start queuing them. When app pool comes back there are a lot of requests waiting + new ones so IIS starts processing a lot of them them => eat CPU/memory/whatever it needs. Not enough resources => slow => more requests queued => more often IIS recycle app pools => snowball. –  Feb 27 '12 at 12:43
  • maybe some StackOverflowException, an infinite loop, recursion –  Feb 27 '12 at 12:45
  • Come on guys, do not post TWICE. – TomTom Feb 27 '12 at 13:33
  • Yeah - We are worried it's some kind of O(n^n) weirdness or some other crap algorithm that's somehow made it into production - however, like I said, in our memory dumps, you'd expect to see our code/namespaces, so we could figure out what the offending method may be. But there's just vanilla .Net calls - nothing proprietary. Weird. – Dave Mar 02 '12 at 11:54

3 Answers3

6

Dave, A few thoughts to start you:

I am assuming it's the w3wp.exe that is eating your resources. If not, it might be worth running some PAL reports to get some better insight into the overall health of the server: http://pal.codeplex.com/ Sometimes I'll even run PAL even if it is an IIS problem... PAL can spot all sorts of problems that you never would think about.

Check Performance Monitor (both before and during your spike)... try to figure out if your ASP.Net Applications Request/Sec is higher during the "slow response" periods... I find that to be the fastest way to tell you if you are handling more requests than normal.

Try to figure out if there is one (or a few) pages that are taking longer to load. Be sure IIS stats are being logged, and then look for an increase in the time-taken. Checkout Log Analyer (http://www.iis.net/community/default.aspx?tabid=34&g=6&i=1864).

Oh, and don't forget the StackExchange mini profiler http://code.google.com/p/mvc-mini-profiler/ once you figure out what URL is causing the problem.

Also, don't overlook any .NET error catching you have in place :-)

Let us know what you see. -Chris

Chris Anton
  • 810
  • 1
  • 6
  • 14
  • Hi Chris - thanks for your help. Annoyingly, we don't generate log files, as we literally don't have the disk space to hold them (50Gig + /d) Requests/sec is pretty standard, with regard to the other servers. Also, we actually use mini-profiler, and have no issues show, so-far. The weird thing about this problem is it's intermittent nature :( – Dave Feb 28 '12 at 10:20
  • Dave, Can you confirmed that w3wp.exe is consuming the processor? Perhaps you could enable IIS logging during the spike? The other option is to set up Failed Request Tracing for requests taking longer than x. That would show any problem at the IIS level. As you suspect though, the problem likely lies at the .net level. – Chris Anton Feb 28 '12 at 14:26
  • Yeah - it's w3wp.exe - thanks again for any help. – Dave Mar 02 '12 at 11:44
3

Use DebugDiag 1.2 to perform the analysis of the dump:

https://www.microsoft.com/download/en/details.aspx?id=26798

It's useful to be aware that any process that is capable of using more than one thread can push utilization to 100% on all processors of a server. This includes native code and even core os components.

When you say "latest patched", to me that means with Windows Update, which does not get a lot of the more serious bugfixes for Windows 2008 R2.

In particular, if the application is accessing any files on remote shares, it would be a good idea to have the file system hotfixes applied:

List of currently available hotfixes for the File Services technologies in Windows Server 2008 and in Windows Server 2008 R2
http://support.microsoft.com/kb/2473205

Greg Askew
  • 34,339
  • 3
  • 52
  • 81
1

Check if it's being targeted by a HashDos attack - and set up request limits.

jamespo
  • 1,698
  • 12
  • 12
  • We actually looked into that - and have applied the recent MS patch, that supposedly mitigates this issue - We actually ran it through a website that supposedly attempts to attack your site - and we 'passed' (I can't bloody remember the site, though) – Dave Feb 27 '12 at 13:04
  • You could test with https://github.com/FireFart/HashCollision-DOS-POC – jamespo Feb 28 '12 at 15:19