X-Post from StackOverflow:
https://stackoverflow.com/questions/9465123/intermittent-high-cpu-100-on-production-webserver
We have a web cluster with 3 web-servers, each with 24 cores & 24GB mem.
Our application is latest patched ASP.NET 4.0, With MVC3, on IIS 7.5 - In it's own application pool.
Very intermittently, (Maybe once every 2/3 days) one of the webservers will stop serving requests, and all 24 cores will show 100% CPU (memory & disk look normal).
The few times when IIS manager isn't completely frozen, the active running requests don't seem to offer any useful information, with a pretty random spread across a large number of site areas/requests.
Once a server has died, we are able to take it out of load - and after maybe 5 minutes of no-longer serving requests, the CPU activity will drop back to normnal - making us think it isn't an infinite loop.
A memory dump of the worker process (around 4GB is size!) doesn't seem to show any of our code/namespaces anywhere in any of the managed stack traces - but simply .Net begin request stuff (It's possible I'm using WinDbg wrong - and not loading our symbols correctly - but the stack traces don't show any missing/unnamed method calls - so I'm quite confused)
Our servers are normally processing 1000 req/sec quite happily, so this is all very strange.
One weird thing we noticed in Perfmon - was the Contention Rate / sec goes to like 800. We don't have any fancy multi-threaded code in our app, and the only locks we have are in our caching code (Which hasn't changed in ages).
Any advice/tips on how to further diagnose this issue would be most appreciated.
Cheers.