8

Ok, our new build is having 100% cpu spikes on each server at random intervals. For long durations it make the site totally unresponsive - this will be at peak times as people in different countries log on to the site etc.

We've looked at perfmom, memory profilers, CLR profiler, sql profilers, Red gate ants profiler, tried load testing in UAT - but cannot even reproduce the problem. This could mean only thousands of users hitting the live site causes it to happen.

One pattern we did notice was that the new code - the broken build - actually uses noticably less threads.

We are also using spring for IOC - does this have a bed reputation?

To make things worse, we cannot deploy to live due to the business impact - so cannot narrow the problem down to subset of the new features we've added.

We truly are destroyed - has anyone got any battle scars that may save us a few lives?

  • What do the temperature sensors report? I wonder if your power supply can't keep up. (No idea how to check this.) – sarnold Nov 23 '11 at 12:10
  • 2
    When you say brings the server down can you add more detail, is it BSOD? Do you mean it restarts or maybe an app domain restart. –  Nov 23 '11 at 12:13
  • There is no way at all a "100% cpu *spike*" could "bring down" the server. It would have to be pegged at 100% for quite a long while, coupled with trouble with heat dissipation. – Andrew Barber Nov 23 '11 at 12:16
  • 1
    What is it doing?? Which process is using the CPU at the peak? This is the most important question. – Aliostad Nov 23 '11 at 12:17
  • Updated my question - is this better? Thanks for the -1 :) –  Nov 23 '11 at 12:28
  • Have you even tried correlating requested pages right before the 'lockup' with the time they start? – Andrew Barber Nov 23 '11 at 12:31
  • How often are the worker processes being recycled? –  Nov 23 '11 at 14:41
  • We have the logs from IIS but it is hard to correlate which particular requests cause the problme –  Nov 23 '11 at 18:33
  • Not sure about the worker processes being recycled? What impact could that have? –  Nov 23 '11 at 18:34
  • Ok - our worker threads are being recycled once per day –  Nov 23 '11 at 18:36
  • One usual suspect would be database locking. What ORM are you using? Also, what are the major architectural differences between the old and new code? – Greg Askew Nov 24 '11 at 17:08

5 Answers5

3

I suggest doing memory dumps and analyzing them in WinDdg with Sos. I fixed some problems on our production I probably wouldn't be able to diagnose without WinDbg.

Tess Fernandez has great blog where you can learn how to analyze memory dumps.

  • that blog is an excellent resource and we have been using it. Our problem is we can't recreate the problem again and get the dumps. –  Nov 24 '11 at 12:34
  • 1
    To recreate the problem, you may hammer your test system with jmeter (http://jmeter.apache.org/) and ab (http://httpd.apache.org/docs/2.0/programs/ab.html). With these, multicores, a fast LAN and some colleagues, you should be able to stress the server enough. – Roman Nov 24 '11 at 16:05
1

This is typically caused by large long-lived object cleanup in the GC(stackoverflow had this problem, see link). Are you storing lots of object collections in cache or session?

Assault by GC

I also recommend you build and configure a new server in production to test. If you have random craziness and don't know why and can't reproduce it, I'd point the finger to hardware or configuration, not code.

rick schott
  • 131
  • 6
  • We can't put any new code live because it adds news features. When the code was live, the GC usage was the same - including for generation 2. Thanks though - do yo have any more suggestions? –  Nov 23 '11 at 12:49
  • It's not impossible, but the hardware and configuration are nearly the same as the last deploy which we have reverted back to and is working successfully. –  Nov 23 '11 at 18:36
1

Is this a virtual server with shared resources or a physical server? If it is the former perhaps you could look at dedicating resources to this server. Good luck...

0

Try using a cache server as a frontend like Apache Traffic Server (ATS).

While this will not resolve the problem, it may help to identify it because you will at the same time move the potentially harmful load from the backend (seeing if the frontend also has problems) and make things less heated on the backend so it will be easier to see what's wrong.

Gil
  • 307
  • 3
  • 12
0

Trying to guess the fault without the data is pointless. Yes someone on stackoverflow or in your engineering team might get lucky but that's just bad engineering, and you can't put a plan on how long it will take you to try every guess, and if thy would even find the problem.

  1. You have to repro the problem. Jmeter is a good start because of its breadth, but we can't recommend the right tool without knowing our architecture.
  2. Logging specially in your application layer is a must. You can enable IIS traces for slow performance, but the muppets at Microsoft made it so you can't capture the entire pipeline flow when it's slow. If it is so difficult to repro, you'd really like some logs to help you narrow down where the problem is. (like oh, it's whenever we call this stored proc).

The 100% CPU is a little suspicious in the sense that it's unlikely to be I/O (providing the db is another box, a slow database should not cause 100% CPU on the webservers). You need to look closer to home.

M Afifi
  • 727
  • 4
  • 7