10

double uploads

Ever since we went from a simple Apache instance to a loadbalanced environment, there are sometimes issues with POST requests getting repeated. We are running nginx as reverse proxy. Static content is being served by nginx itself, and dynamic content is served from two Apache backends.

I have checked that it is not an interface/user error. A small for-instance: a simple image-upload will result in the image getting uploaded twice. The request/POST isn't send twice by double clicking or user-failure. I have not found any evidence that the browser is sending the request twice, so my suspicion is on the server-side. (Notice that this is suspicion only.) Most of these requests are internal, meaning that they are from employees, so I can verify how they come about.

The only thing 'wrong' I can find is that nginx will log a 499 error in these cases. I'm not sure however if this is the cause or just a (side)effect of the problem. (I'm aware that 499 isn't a default http status, it's a nginx status meaning "client has closed connection")

requests

The repeated POST-requests are almost all requests that can take a while. The one I'm showing here as an example is a simple image upload, but the script does some stuff in the background (the image must be converted into different formats/sizes, and should be distributed to both servers, etc).

logs

An example is the uploading of an image. nginx will log one '499', and one 200 request, but Apache is receiving (and handling!) two requests.

Apache

[17:17:37 +0200] "POST ***URL** HTTP/1. 0" 200 9045   
[17:17:47 +0200] "POST ***URL** HTTP/1. 0" 200 20687

nginx

[17:17:47 +0200] "POST ***URL** HTTP/1.1" 499 0 
[17:17:52 +0200] "POST ***URL** HTTP/1.1" 200 5641

Suspicions

It seems to me that bigger/slower uploads suffer from this more, so I suspect a timeout. I have tried to read up on the 499 error: conclusions seem to be that it is "client closed connection". That could be the case in the background, but I'm not sure how this would mean that a second request should be issued and there certainly isn't something like "user closed browser" going on.

Currently it seems to help to break up slower POST requests (if there are multiple things to do, just make the user choose 1 and POST a second time for the other), but this could be just lowering the chance that it occurs. Not sure.

This is obviously a temporary solution. If it is a timeout, I need to find out where and increase the corresponding numbers, but I'm not sure why a timeout would cause this behaviour:I'd suspect a "well, that went wrong" message, not a repeat.

Questions

I'm looking to find out what process/situation can cause a POST to be repeated. Of course, any "not sure why, but it will be fixed by increasing this timeout" is great as well.

nginx configurations

NGINX.conf

user  nginx;
worker_processes  2;
worker_rlimit_nofile 10240;

error_log  /var/log/nginx/error.log error;
pid        /var/run/nginx.pid;


events {
    multi_accept on;
    worker_connections  4096;
    use epoll;
}

http {
    include       /etc/nginx/mime.types;
    default_type  application/octet-stream;

    log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                      '$status $body_bytes_sent "$http_referer" '
                      '"$http_user_agent" "$http_x_forwarded_for"';

    access_log  /var/log/nginx/access.log  main;

    sendfile        on;
    tcp_nodelay     off;    
    client_max_body_size    30m;
    keepalive_timeout  65;
    include /etc/nginx/conf.d/*.conf;
}

conf.d

I have removed some IP-specific lines in the geo parts, as well as the SSL variations, to keep it simple. If needed I can replace them, but it boils down to an extra geo part for the ssl backends, and corresponding upstreams and conf files.

geo $backend {
    default apache-backend;
}

upstream apache-backend {
    ip_hash;
    server SERVER1 max_fails=3 fail_timeout=30s weight=2;
    server SERVER2 max_fails=3 fail_timeout=30s weight=3;
}

conf.d/somestring.conf

limit_conn_zone $binary_remote_addr zone=somestring:10m;
server {
    listen ip1:80;
    listen ip2:80;
    server_name name.tld www.name.tld;

    root PATH

    access_log PATH/name.log main;

    location / {
            proxy_pass              http://$backend;
    }

            //*some more locations**//

    gzip on;
    gzip_http_version 1.0;
    gzip_comp_level 2;
    gzip_proxied any;
    gzip_min_length 1100;
    gzip_buffers 16 8k;
    gzip_types text/plain text/css application/x-javascript text/xml application/xml application/xml+rss text/javascript;
}

conf.d/proxy.conf

proxy_set_header        Accept-Encoding "";
proxy_set_header        X-Real-IP $remote_addr;
proxy_set_header        X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header        Host $http_host;

proxy_buffering         on;
proxy_read_timeout      90;
proxy_buffer_size       32k;
proxy_buffers           8 32k;
proxy_busy_buffers_size    32k;
proxy_temp_file_write_size 32k;
Nanne
  • 602
  • 2
  • 8
  • 26
  • And what are you using for a load balancer? – Michael Hampton Jul 15 '13 at 16:40
  • The nginx reverse proxy is our loadbalancer,I was trying to make that clearer by replacing loadbalancer with 'reverse proxy' in that second sentence :) – Nanne Jul 15 '13 at 17:13
  • OK, so what is your nginx configuration? – Michael Hampton Jul 15 '13 at 17:20
  • I didn't include it, as I'm not sure what could be the cause of the repeated request. If you suspect a certain setting in `nginx` can do this, I'd love to hear your thoughts. I'll post the nginx settings in a bit! (I didn't want to make this a "here's the dump of all random settings I could find" question, and as I hope I explained, I'm trying to figure out what part of the chain can be responsible) – Nanne Jul 15 '13 at 18:28
  • (if it takes a while, I'm removing all things like IP's and other url-related stuff) – Nanne Jul 15 '13 at 18:33
  • Does `nginx -T` mention `proxy_next_upstream`? Timestamps don't seem to quite line up? What if you have the browsers hit apache directly? Does tcpdump show it as repeated internal proxy requests only? Newer `nginx` help? – rogerdpack Nov 21 '19 at 20:37
  • As this is a couple of years ago, I can't replicate that exact situation anymore, sorry. The timestamps where for the same request, I'm not sure anymore why they didn't line up: it's either just a bad time on one of the servers, or the difference between request and result (aka the time it took to calculate). Newer nginx might have helped, that is: our current set-up doesn't have this issue anymore, but loads of things have changed in 6 years :) – Nanne Nov 22 '19 at 08:58

2 Answers2

5

Short answer: try this for your location block:

location / {
  proxy_read_timeout 120;
  proxy_next_upstream error;
  proxy_pass http://$backend;
}

Longer explanation:

I think I have just encountered exactly the problem you described:

  • I use nginx reverse proxy as a load balancer
  • for long-running requests, the backend receives the same request multiple times
  • the nginx access logs of the upstream nodes show 499 status for these requests, and the same request appears in different upstream nodes

It turns out that this is actually default behaviour for nginx as reverse proxy, and upgrading it to higher versions will therefore not resolve this issue, although it was given as a possible solution here, but this addresses a different issue.

It happens because nginx as a load balancer chooses an upstream node in round-robin fashion. When the chosen node fails, the request is sent to the next node. The important thing to note here is that node failure is by default classed as error or timeout. Since you did not set a proxy_read_timeout, the default is 60 seconds. So after 60 seconds of waiting, nginx picks the next node and sends the same request.

So one solution is to increase this timeout so that your long-running operation can complete, e.g. by setting proxy_read_timeout 120; (or whatever limit suits your needs).

Another option is to stop the reverse proxy from trying to use the next node, in case of timeout, by setting proxy_next_upstream error;. Or you could set both these options, as suggested above.

pymkin
  • 225
  • 3
  • 8
  • Hmm, interesting! There doesn't seem to be anything wrong with your answer, but in my specific case, I'm troubled by the fact that disabling SPDY did fix it; we do have a good amount of long running processes that do not show this behaviour (anymore). Maybe disabling SPDY removed some sort of secondary behaviour that coincidently fixed this for us, and this method might even be better? We haven't had this problem for a long time now. I'll see what our current settings are, and if we can incorporate this tip somewhere, thanks! – Nanne Jun 13 '16 at 14:32
  • I too saw 499's then 502's once, turns out it was "mere coincidence" though see here: http://stackoverflow.com/a/58924751/32453 (as a note NGINX isn't supposed to send POST's on to the next server, by default, FWIW...) – rogerdpack Nov 21 '19 at 20:38
1

From this forum topic we learned that the culprint might be SPDY. For that user it seems a solution to disable it, and we have not had double posts since disabling it either.

The exact problem, other then "SPDY did it", is unknown at this point, side-effects of the proposed solution (disable SPDY) are obviously "no more SPDY", but we can live with that.

Until the bug jumps up again I call this a 'fix' (or at least: a solution to the problem).

edit: We have not (25-02-2014) seen this problem come up anymore, so this seems like a lasting solution indeed.

Nanne
  • 602
  • 2
  • 8
  • 26