13

We have an API that is implemented using ServiceStack which is hosted in IIS. While performing load testing of the API we discovered that the response times are good but that they deteriorate rapidly as soon as we hit about 3,500 concurrent users per server. We have two servers and when hitting them with 7,000 users the average response times sit below 500ms for all endpoints. The boxes are behind a load balancer so we get 3,500 concurrents per server. However as soon as we increase the number of total concurrent users we see a significant increase in response times. Increasing the concurrent users to 5,000 per server gives us an average response time per endpoint of around 7 seconds.

The memory and CPU on the servers are quite low, both while the response times are good and when after they deteriorate. At peak with 10,000 concurrent users the CPU averages just below 50% and the RAM sits around 3-4 GB out of 16. This leaves us thinking that we are hitting some kind of limit somewhere. The below screenshot shows some key counters in perfmon during a load test with a total of 10,000 concurrent users. The highlighted counter is requests/second. To the right of the screenshot you can see the requests per second graph becoming really erratic. This is the main indicator for slow response times. As soon as we see this pattern we notice slow response times in the load test.

perfmon screenshot with requests per second highlighted

How do we go about troubleshooting this performance issue? We are trying to identify if this is a coding issue or a configuration issue. Are there any settings in web.config or IIS that could explain this behaviour? The application pool is running .NET v4.0 and the IIS version is 7.5. The only change we have made from the default settings is to update the application pool Queue Length value from 1,000 to 5,000. We have also added the following config settings to the Aspnet.config file:

<system.web>
    <applicationPool 
        maxConcurrentRequestsPerCPU="5000"
        maxConcurrentThreadsPerCPU="0" 
        requestQueueLimit="5000" />
</system.web>

More details:

The purpose of the API is to combine data from various external sources and return as JSON. It is currently using an InMemory cache implementation to cache individual external calls at the data layer. The first request to a resource will fetch all data required and any subsequent requests for the same resource will get results from the cache. We have a 'cache runner' that is implemented as a background process that updates the information in the cache at certain set intervals. We have added locking around the code that fetches data from the external resources. We have also implemented the services to fetch the data from the external sources in an asynchronous fashion so that the endpoint should only be as slow as the slowest external call (unless we have data in the cache of course). This is done using the System.Threading.Tasks.Task class. Could we be hitting a limitation in terms of number of threads available to the process?

  • 5
    How many cores does your CPU have? Perhaps you're maxing out one core. When the magic number is 50%, 25% or 12.5%, that suggests that you've maxed out a core and for some reason aren't able to use the other cores that are sitting idle. Check for a maxed out core. – David Schwartz Nov 13 '13 at 03:24
  • @DavidSchwartz The servers have 4 cores each. Unfortunately I don't have data for each CPU while the load test was running. – Christian Hagelid Nov 13 '13 at 04:55
  • If you're using connection strings, IIRC, the default max pool size is 100. You may want to look at specifying something higher in your web.config. – tacotuesday Nov 13 '13 at 07:14
  • @nojak There are no DB connections. The external systems are all external API's – Christian Hagelid Nov 13 '13 at 11:33
  • 1
    Have you got one thread per request? So for 5000 requests have you got 5000 threads? If you do then that is likely your problem. You should instead create a thread pool and use the thread pool to process the requests, queuing up the requests as they come in to the thread pool. When a thread has finished with a request it can process a request off the queue. This sort of discussion is best for stackoverflow. Too many threads means too many context switches. – hookenz Nov 20 '13 at 23:42
  • 1
    Just a sanity check here, have you tried turning off all of your background processes and see what the behavior is just for the JSON returning static data from cache? In other words, making your JSON requests static data and removing the "external async calls" that refresh your cache completely. Also, depending on the amount of JSON data being served on every request, have you thought about your network throughput and if requests are starting to back up because the servers just can't push the data out fast enough? – Robert Nov 20 '13 at 23:52
  • 1
    +1 to Davids suggestion above. You should really redo the test and look carefully at each core utilisation. I'd suggest you do this asap to eliminate it if nothing else. Secondly, I'm a bit suspicious of your cache. Lock contention can show exactly this kind of behaviour - at some critical point locks cause delays which in turn cause locks to be held for longer than normal, causing a tipping point where things go downhill rapidly. Can you share your caching and locking code ? – steve cook Nov 25 '13 at 15:27
  • The graph of requests-per-second is interesting, but I'm having a hard time reading the fine print. Can you explain what each of the components is, particularly the line in the middle? I think it is the current connections, but I'm not sure. Can you re-post a larger screenshot? –  Nov 21 '13 at 15:47
  • 1
    What is the disk setup for the servers (assuming that since they're load balanced the disk setup is the same)? Can you post all the specs for the drives/servers in your initial post? Have you thrown a perfmon on the disk(s) on the physical drive(s) that IIS AND the IIS log files exist on? It's quite possible you may be experiencing problems with the disk in that 3,500 requests = 3,500+ IIS log entires. If they're on the same disk/partition you could have a big problem there. – Techie Joe Nov 28 '13 at 19:53
  • Another point to check is CPU cache efficiency. It is probable that the task does not fit cache size after 3500 concurrent connections. In hyperthreaded enviroment simple CPU load monotoring may not produce accurate results. Intel PCM can help, IINM. – Veniamin Nov 30 '13 at 12:35
  • @ChristianHagelid did you find any answer of this query? We are also using a centeralised cache and servicestack. – codebased Jun 26 '18 at 23:54

1 Answers1

2

Following with @DavidSchwartz and @Matt this looks like a threads, locks managing issue.

I suggest:

  1. Freeze the external calls and the cache generated for them and run the load test with static external information just to discard any issue not related with server - environment side.

  2. Use thread pools if not using them.

  3. About external calls you said "We have also implemented the services to fetch the data from the external sources in an asynchronous fashion so that the endpoint should only be as slow as the slowest external call (unless we have data in the cache of course)."

Questions are: - Have you checked if any cache data is locked during the external call or only when writing the external call result into the cache? (too obvious but must say). - Do you lock the whole cache or smalls parts of it? (too obvious but must say). - Even if they are asynchronous, how often do external calls run? Even if they don't run so often, they could be blocked by excessive amount of requests to the cache from the user calls while the cache is locked. This scenario usually shows fixed percentage of CPU used because many threads are waiting in fixed intervals and the "locking" must also be managed. - Have you checked if external tasks mean response time also increases when the slow scenario arrives?

If the problem still persists, I'd suggest avoiding the Task class and make the external calls through the same thread pool that manages the user requests. This is to avoid the previous scenario.

jamiescott
  • 64
  • 5
SaintJob 2.0
  • 147
  • 8