0

Thanks guys, any ideas/insight are appreciated b/c this is driving me crazy.

Problem: Only about 3 or 4 users can use the server simultaneously before app grinds to a halt.

Currently we see massive spikes in the CPU usage with normal usage. This is easier to reproduce with real users than with automated scripts, for unknown reasons, but it's possible the scripts don't do a great job of simulating real usage.

Our architecture is as follows:

  • App server (Tornado) - single threaded, with an asynchronous IO Loop. We use Tornado to handle persistent connections associated with long-polling, and send all basic web requests to Django via WSGI.
  • Django ORM is used to interact with database, although most SQL is hand-coded
  • MySQL database
  • Nginx serves static media and proxies other requests to Tornado
  • Everything is currently setup to run on one "Small" EC2 instance. Separating the servers between machines doesn't have a noticeable impact on performance

See EC2 server spec: http://aws.amazon.com/ec2/instance-types/ for more details on server configuration. Note: All-in-all this isn't the ideal & most scaleable setup, but it should be able to handle more than 3 users!

Running top & viewing logs reveals the following:

  • CPU spikes are attributed mostly to Tornado, about 25% extra CPU usage per active user
  • Low "steal-time", so our CPU power isn't being heavily throttled by EC2 (anymore)
  • DB queries are all between 0-200ms when CPU isn't spiking, but often 3 seconds or more during spikes
  • Memory usage is low and never spikes

Some things that have been tried to no avail:

  • Configure MySQL buffer sizes, indexes, etc.. I'm 99% sure this isn't a garden-variety SQL/DB optimization issue
  • Improve query times and reduce number of queries in all sorts of ways
  • Put servers on separate ec2 instances
  • Proxy between multiple app servers (this would obviously be a lot more scaleable, but it doesn't fix the 3-users-per-instance issue)
  • Upgrade the EC2 instance. Upgrading from "Micro" did help (due to CPU throttling issues) but only slightly increased our capacity
  • Deploy on non-EC2 server (Slicehost) - same problems
  • All servers have been load tested individually with simple test cases, and all have been able to handle 1000's of simultaneous connections
Phil W
  • 1

1 Answers1

1

"Tornado, about 25% extra CPU usage per active user" -- with a single-threaded app, if each user is chewing 25% of a core, you're only going to get to 4 users max before the app is saturating the only core it's capable of using. Work out why Tornado is such a ridiculous CPU hog (is your code bad, or is Tornado bad?), and your solution will fall out the bottom.

womble
  • 95,029
  • 29
  • 173
  • 228