3

We are facing a problem with inconsistent 502 errors and tracking down the reasons has been a very frustrating exercise. We can reproduce the problem by sending several simultaneous requests quickly. The problem is that several is only in the range of 10 to 20 within a 5 seconds (not a typo). So clearly this type of load should be handled easily.

We really like the Nginx + Tornado approach but are considering going to a more traditional (e.g. threading) approach because this problem has been very difficult to solve. I was wondering if you a) know how to fix this issue and b) how we can tracked down the culprit(s).

The log files simply identify there being a connection refused. We have the same problem as this post: https://stackoverflow.com/questions/2962439/how-do-i-debug-a-http-502-error

But there is no answer provided on how to solve the problem so I'm hoping you can help because this may be a common issue with this type of setup.

Thanks in advance,

Paul

PlaidFan
  • 131
  • 3
  • Have you tried the putting the same load on a single Tornado instance? I'm guessing you're using one frontend NGINX server mapping to multiple Tornado instances as reccomended in the manual but make sure all long database queries are async requests so as not to lock up the web threads – Smudge Aug 01 '11 at 08:12
  • This does not sound like normal behavior so I'm guessing there is a config problem somewhere. Are they actually refused connections or is NGINX timing out on connection (Which would suggest the applications are blocking the web threads in Tornado), we run a server with 8 tornado threads behind NGINX that easily handles over 100 requests a second so even on a low spec server you should manage at least 10 rps – Smudge Aug 01 '11 at 08:15
  • Thanks. We are looking into the database side right now. We use the SQL Alchemy pool for our connections to MySQL. We were thinking that since each request is handled in its own sub-process that the database connections would also operate within a sub-process which would prevent blocking. But that may not be the case. – PlaidFan Aug 01 '11 at 14:39
  • You might have more luck on the TornadoWeb mailing list http://groups.google.com/group/python-tornado – Smudge Aug 01 '11 at 14:44
  • I think we found the root cause of this issue. It appears that if the NGINX conf file has 4 servers listed but only one Tornado server is spun up at the time then it will raise 502 Errors. We can't identify a setting for number of failed server connection attempts but it appears that we were able to eliminate the problem by pairing the NGINX conf file server count with the Tornado server count. – PlaidFan Aug 01 '11 at 15:56

1 Answers1

2

By default nginx is not configured to retry connections to another upstream if one of them sends back a 502 error. You basically need to add this:

proxy_next_upstream error timeout http_502;

To your configuration. This will prevent the 502 errors from being sent directly back to the client and instead cause nginx to try and hunt for a better upstream. It will attempt all of the upstreams before failing back to the client according to this post:

http://forum.nginx.org/read.php?2,152071,152212

Here is more details on the configuration directive:

http://wiki.nginx.org/HttpProxyModule#proxy_next_upstream

polynomial
  • 3,968
  • 13
  • 24