12

As the question title suggests, I'm having an hard time to figure out what can be improved on my application(or tuned in the os, ubuntu) to achieve an acceptable performance. But first I'll explain the architecture:

The front-end server is an 8 core machine with 8 gigs RAM running Ubuntu 12.04. The application is written entirely in javascript and run in node.js v 0.8.22 (as some modules seem to complain on newer versions of node) I use nginx 1.4 to proxy http traffic from port 80 and 443 to 8 node workers that are managed and started using the node cluster api. I use the latest version of socket.io 0.9.14 to handle the websocket connections, on which I've enabled only websockets and xhr-polling as available transports. On this machine I also run an instance of Redis(2.2)

I store persistent data(like users and scores) on a second server on mongodb(3.6) with 4gigs RAM and 2 cores.

The app is in production since few months(it's been running on a single box until few weeks ago) and it's being used by around 18k users per day. It has always worked very well apart of one main issue: performance degradation. With use, the amount of cpu used by each process grows until it staturates the worker(which won't serve requests anymore). I have temporarily solved it checking the cpu in use by each worker every minute, and restarting it if it reaches 98%. So the problem here is mainly cpu, and not RAM. The RAM is not an issue anymore since I've updated to socket.io 0.9.14(the earlier version was leaking memory) so I doubt it to be a memory leaking issue, especially because now it is the cpu that grows fairly quickly(I have to restart each worker around 10-12 times a day!). The RAM in use grows as well to be honest, but very slowly, 1 gig every 2-3 day of use, and the strange thing is that it is not released even when I completely restart the whole application. It is only released if I reboot the server! this I cannot really understand...

I've now discovered nodefly which is amazing, so I can finally see what's happening on my production server, and I'm collecting data since a couple of days. If anyone want to see the charts I can give you access, but basically I can see that I have between 80 and 200 concurrent connections! I was expecting node.js to handle thousands, not hundreds of requests. Also the average response time for http traffic floats between 500 and 1500 milliseconds which I think is really a lot. Also, in this very moment with 1300 users online, this is the output of "ss -s":

Total: 5013 (kernel 5533)
TCP:   8047 (estab 4788, closed 3097, orphaned 139, synrecv 0, timewait 3097/0), ports 0

Transport Total     IP        IPv6
*         5533      -         -
RAW       0         0         0
UDP       0         0         0
TCP       4950      4948      2
INET      4950      4948      2
FRAG      0         0         0

which shows that I've got a lot of closed connections in timewait. I've increased the max open files to 999999, here is the output of ulimit -a:

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 63724
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 999999
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 63724
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

So I thought the problem could be on http traffic that for some reasons saturates the available ports/sockets(?), but one thing does not make sense to me: why when I restart the workers, and all the clients reconnect within few seconds, the load on the cpu of the worker goes down to 1% and is capable of serving requests properly until it saturates after about 1 hour (at peak time)?

I'm mainly a javascript programmer, not a sys admin so I don't know how much load I should expect to handle with my servers, but surely it's not performing as it should. The application is stable otherwise and this last problem is preventing me to ship the mobile versions of the app which are ready, as obviously they will bring more load and eventually crash the whole thing!

Hopefully there is something obvious that I'm doing wrong, and someone will help to spot it...feel free to ask me for more info, and I'm sorry for the length of the question but was necessary I believe...thanks in advance!

rvs
  • 4,027
  • 1
  • 25
  • 30
Franjanko
  • 276
  • 2
  • 8
  • Is there any way to get something like thread dump from node.js? There are probably some threads in an infinite loop. Also, what is actually using cpu? What do you see in `top` when cpu usage is close to 100%? – rvs May 02 '13 at 22:37
  • cpu is used entirely by nodejs, when I run top I see the node processes taking all the cpu. Not sure how I can output a thread dump from node to be honest... – Franjanko May 02 '13 at 22:45
  • another thing to point is that majority of the CPU time seems to go to system, not user time – Franjanko May 02 '13 at 23:01
  • Does anyone at least know how many concurrent connections should I be able to handle with the servers I've got in place? at the moment I support 200 concurrent connections max. This will help me to estimate how far I am from an optimal configuration...thanks. – Franjanko May 06 '13 at 13:20

3 Answers3

11

After few days of intense trial and errors, I'm glad to be able to say that I've understood where the bottleneck was, and I'll post it here so that other people can benefit from my findings.

The problem lies in the pub/sub connections that I was using with socket.io, and in particular in the RedisStore used by socket.io to handle inter-process communication of socket instances.

After realizing that I could implement easily my own version of pub/sub using redis, I decided to give it a try, and removed the redisStore from socket.io, leaving it with the default memory store(I don't need to broadcast to all connected clients but only between 2 different users connected possibly on different processes)

Initially I declared only 2 global redis connections x process for handling the pub/sub on every connected client, and the application was using less recources but I was still being affected by a constant CPU usage growth, so not much had changed. But then I decided to try to create 2 new connections to redis for each client to handle their pub/sub only on their sessions, then close the connections once the user disconnected. Then after one day of usage in production, the cpu's were still at 0-5%...bingo! no process restarts, no bugs, with the performance I was expecting to have. Now I can say that node.js rocks and am happy to have choosen it for building this app.

Fortunately redis has been designed to handle many concurrent connections(differently by mongo) and by default it's set at 10k, that leaves room for around 5k concurrent users, on a single redis instance, which is enough for the moment for me, but I've read that it can be pushed up to 64k concurrent connections, so this architecture should be solid enough I believe.

At this point I was thinking to implement some sort of connection pools to redis, to optimize it a little further, but am not sure if that won't cause again the pub/sub events to build up on the connections, unless each of them is destroyied and recreated each time, to clean them.

Anyway, thanks for your answers, and I'll be curious to know what you think, and if you have any other suggestion.

Cheers.

Franjanko
  • 276
  • 2
  • 8
  • 2
    I am having what appears to be the same problem in my production app, also new to the server admin role. I follow what you did in concept, but I have some questions about how to do it - perhaps you could provide a link to some resource in your accepted answer? Or simply provide more information? In particular about "But then I decided to try to create 2 new connections to redis for each client to handle their pub/sub only on their sessions, then close the connections once the user disconnected." – toblerpwn Jul 30 '13 at 03:25
2

Do you have some source code to dump? It may be connections to database not closed? Processes waiting for HTTP connections that never close.

Can you post some logs?

Do a ps -ef and make sure nothing is still running. I have seen web processes leave zombies that won't die until you do a kill -9 . Sometimes shutdown doesn't work or doesn't work fully and those threads or processes will hold RAM and sometimes CPU.

It could be an infinite loop somewhere in the code or a crashed process holding ontop a db connection.

What NPM modules are using? Are they all the latest?

Are you catching exceptions? See: http://geoff.greer.fm/2012/06/10/nodejs-dealing-with-errors/ See: https://stackoverflow.com/questions/10122245/capture-node-js-crash-reason

General Tips:

http://clock.co.uk/tech-blogs/preventing-http-raise-hangup-error-on-destroyed-socket-write-from-crashing-your-nodejs-server

http://blog.nodejitsu.com/keep-a-nodejs-server-up-with-forever

http://hectorcorrea.com/blog/running-a-node-js-web-site-in-production-a-beginners-guide

https://stackoverflow.com/questions/1911015/how-to-debug-node-js-applications

https://github.com/dannycoates/node-inspector

http://elegantcode.com/2011/01/14/taking-baby-steps-with-node-js-debugging-with-node-inspector/

Tim Spann
  • 141
  • 3
1

Not an answer per se, as your question is more of a tale than a one-answer point-out question.

Just to tell that I successfully built a node.js server with socket.io handling over 1 million persistent connections with a message payload average of 700 Bytes.

Network Interface Card at 1Gbps was saturating at the beginning, and I was seeing a LOT of I/O wait from publish events to all clients.

Removing nginx from the proxy role also had returned precious memory, because to reach one million persistent connections with only ONE server, is a tough job of tweaking configs, application, and tuning OS parameters. Keep in mind that it's only doable with a lot of RAM (around 1M websockets connections eats about 16GB of RAM, with node.js, I think using sock.js would be ideal for low-memory consumption, but for now, socket.io consumes that much).

This link was my starting point to reach that volume of connections with node. Besides it being an Erlang app, all the OS tuning is pretty much application agnostic and should be of use by anyone who aims at a lot of persistent connections (websockets or long-polling).

HTH,

Marcel
  • 1,575
  • 8
  • 14