Gunicorn does not repondes more than 6 requests at a time

Question

To give you some context:

I have two server environments running the same app. The first, which I intend to abandon, is a Standard Google App Engine environment that has many limitations. The second one is a Google Kubernetes cluster running my Python app with Gunicorn.

Concurrency

At the first server, I can send multiple requests to the app and it will answer many simultaneously. I run two batches of simultaneous requests against the app on both environments. At Google App Engine the first batch and the second were responded simultaneously and the first din't block the second.

At the Kubernetes, the server only responses 6 simultanous, and the first batch blocks the second. I've read some posts on how to achieve Gunicorn concurrency with gevent or multiple threading, and all of them say I need to have CPU cores, but the problem is that no matter how much cpu I put into it, the limitation continues. I've tried Google nodes from 1VCPU to 8VCPU and it doesn't change much.

Can you guys give me any ideas on what I'm possibly missing? Maybe Google Cluster nodes limitation?

Kubernetes response waterfall

As you can notice, the second batch only started to be responded after the first one started to finish.

App Engine response waterfall

Gunicorn configuration

I've tried with both the standard class with the recommended setting: 2 * cores + 1 for and 12 threads.

I've also tried gevent with --worker-connections 2000.

None of them made a difference. The response times were very similar.

My kubernetes file container section:

    spec:
      nodeSelector:
        cloud.google.com/gke-nodepool: default-pool
      containers:
      - name:  python-gunicorn
        image:  gcr.io/project-name/webapp:1.0
        command:
          - /env/bin/gunicorn
          - --bind
          - 0.0.0.0:8000
          - main:app
          - --chdir
          - /deploy/app
          #- --error-logfile
          #- "-"
          - --timeout
          - "7200"
          - -w
          - "3"
          - --threads
          - "8"
          #- -k
          #- gevent
          #- --worker-connections
          #- "2000"

Might want to share your gunicorn configuration- worker class and worker count, etc. — Jonah Benton, Mar 23 '18 at 18:32
Does this issue is still valid? If yes, what GKE version are you using? Could you check logs if there is anything suspicious? — PjoterS, Feb 15 '21 at 13:55
@PjoterS. Turns out that with Kubernetes, multitasking is at the pod level. Instead of having one big pod with many threads, you can have many smaller pods running. You could experiment on that switch. — Mauricio, Feb 16 '21 at 12:52
@Mauricio Could you elaborate your comment and post it as an answer? It might help other community members. — PjoterS, Feb 16 '21 at 12:58

PjoterS · Answer 1 · 2021-02-26T13:09:15.677

Posting this Community Wiki for better visibility for community.

Unfortunately, I don't have all information to reproduce exactly this scenario (application design, how tests were executed, environment, etc). However, based on OP's comment:

Turns out that with Kubernetes, multitasking is at the pod level. Instead of having one big pod with many threads, you can have many smaller pods running. You could experiment on that switch.

It looks, like OP in his GKE cluster used HPA with CPU and Cluster Autoscaling similar solution which was described in App Engine Flex || Kubernetes Engine — ? article.

Important thing which is worth to mention is that many depends on scaling types.