nginx + fastCGI + Django - getting data corruption in the responses sent to the client

Question

I am running Django behind nginx using FastCGI. I have discovered that in some of the responses sent to the client, random data corruption is occurring in the middle of the responses (might be a couple hundred bytes or so in the middle).

At this point I have narrowed it down to either being a bug in either nginx's FastCGI handler or Django's FastCGI handler (i.e. probably a bug in flup), since this problem never occurs when I run the Django server in standalone (i.e. runserver) mode. It only happens in FastCGI mode.

Other interesting trends:

It tends to happen on larger responses. When a client logs in for the first time, they are sent a bunch of 1MB chunks to sync them up to the server DB. After that first sync, the responses are much smaller (usually a few KB at a time). The corruption always seems to happen on those 1MB chunks sent at the start.
It happens more often when the client is connected to the server via LAN (i.e. low-latency, high-bandwidth connection). This makes me think there is some kind of race condition in nginx or flup that is exacerbated by an increased data rate.

Right now, I've had to work around this by putting an extra SHA1 digest in the response header, and having the client reject responses where the header doesn't match the body checksum, but this is kind of a horrible solution.

Has anyone else experienced anything like this, or have any pointers as to how to identify whether it is flup or nginx that is at fault here so I can file a bug with the appropriate team?

Thanks in advance for any help.

Note: I also posted a similar bug in lighttpd + FastCGI + Django a while back here: https://stackoverflow.com/questions/3714489/lighttpd-fastcgi-django-truncated-response-sent-to-client-due-to-unexpected ... even though this isn't the same thing (truncation vs corruption), it's starting to look like the common culprit is flup / Django rather than the web server ..

Edit: I should also note what my environment is:

OSX 10.6.6 on a Mac Mini
Python 2.6.1 (System)
Django 1.3 (from official tarball)
flup 1.0.2 (from Python egg on flup site)
nginx +ssl 1.0.0 (from Macports)

EDIT: In response to Jerzyk's comment, the code path that assembles the response looks like (edited for succinctness):

# This returns an objc NSData object, which is an array.array 
# when pushed through the PyObjC bridge
ret = handler( request ) 

response = HttpResponse( ret )
response[ "Content-Length" ] = len( ret )
return response

I don't think it's possible that the Content-Length is wrong based on that, and AFAIK there is no way to mark a Django HttpResponse object as explicitly binary as opposed to text. Also, since the issue happens only intermittently, I don't think that explains it otherwise presumably you would see it on every request.

EDIT @ionelmc: You have to set the Content-Length in Django - nginx does not set this for you, as per the below example once I disabled setting Content-Length explicitly:

$ curl -i http://localhost/io/ping
HTTP/1.1 200 OK
Server: nginx/1.0.0
Date: Thu, 23 Jun 2011 13:37:14 GMT
Content-Type: text/html; charset=utf-8
Transfer-Encoding: chunked
Connection: keep-alive

AKSJDHAKLSJDHKLJAHSD

If the initial chunks do not change often or are not user specific maybe writing to disk and serving directly via nginx is a better way? — sunn0, Oct 22 '10 at 21:32
Unfortunately, the chunks are both user-specific and frequently changing, so no caching of that kind would be appropriate for this application. I'm also keen to find out what is actually causing this data corruption rather than just working around it (which I am already doing with the extra SHA1 digest in the header). — glenc, Oct 23 '10 at 04:02
I can think of two possible reasons: wrong encoding - HttpRespose as text vs. binary or wrong headers (especially content-length) — Jerzyk, Apr 02 '11 at 17:50
@Jerzyk - thanks for the response ... please see my additional edits in my original post above for my reply. — glenc, Apr 04 '11 at 16:14
@glenc what is a content-type for this response? if this is binary - can you try to set it? (e.g. mimetype='application/x-ms-excel' or else) — Jerzyk, Apr 04 '11 at 19:32
Ahhh ... this might be something. The response body is actually a blowfish-encrypted binary blob, but since I wasn't explicitly setting any content-type, it was defaulting to text/html. I'll try setting this and see if it fixes the issue. Unfortunately it only manifests rarely so it'll probably take a little while to confirm that it's really gone. Thanks for the suggestion! — glenc, Apr 07 '11 at 15:18
You shouldn't set the content-length - the gateway will handle that. I suspect this is causing you problems. — ionelmc, Jun 23 '11 at 13:12
You have to do this if you want it passed back to the client, as per edit in the original post — glenc, Jun 23 '11 at 13:43
You don't need to set the content-length if your Transfer-Encoding is chunked. rfc 2616 explicitly forbids this: "The Content-Length header field MUST NOT be sent if these two lengths are different (i.e., if a Transfer-Encoding header field is present)." — ionelmc, Jun 23 '11 at 14:01

score 1 · Answer 1 · answered Oct 31 '11 at 14:52

Do you have any kind of nginx caching ( bypass / no_cache ) directive active for the fastcgi responses?

In nginx' 1.0.3 Changenotes they fixed a response corruption:

Bugfix: a cached response may be broken if "proxy/fastcgi/scgi/ uwsgi_cache_bypass" and "proxy/fastcgi/scgi/uwsgi_no_cache" directive values were different; the bug had appeared in 0.8.46.

Source: http://nginx.org/en/CHANGES ( 1.0.3. section )

score 0 · Answer 2 · edited Sep 30 '15 at 16:19

One way to troubleshoot this case a bit more would be to:

have nginx and django running on different hardware (so that you can easily capture the traffic)
capture the traffic from the client to --/-> nginx and nginx --/-> django (i.e. use wireshark)

Once you detect an error in the client side (based on the sha1), go to the network capture, look into the recorded (TCP) stream and try to find whether the problem is generated by nginx or does it come (directly) from django.

score 0 · Answer 3 · answered Aug 12 '11 at 11:32

0

Perhaps the occasional corruption only happens if the output contains at least one UTF-8 character.

Content-length and string length are not the same thing, because one UTF-8 character can contain 2 to 5 bytes.

answered Aug 12 '11 at 11:32

Andy Lee Robinson

221
1
5

Hmmmm .. while this is true it doesn't seem likely to be the cause because the corruption was happening in the middle of the data chunks and was not simply a case of missing data at the end. – glenc Oct 12 '11 at 22:51

score 0 · Answer 4 · answered Sep 27 '19 at 09:58

I had a very similar issue which was plaguing me for as long as I've had this setup. Like you, I use FastCGI, Nginx and macOS, and found random corruption in the middle of large requests (it was about 2% of requests of a 1.5 MB document).

I was able to solve my problem by switching to Unix sockets over TCP for the FastCGI connection between PHP-FPM (in my case) and Nginx. I don't know which piece of the puzzle is responsible for the corruption, but avoiding the internal TCP connection did fix it.

nginx + fastCGI + Django - getting data corruption in the responses sent to the client

4 Answers4