I have a system hitting a number of backend services through four loadbalanced varnish caches. Approximately 0.1% of requests are being truncated because varnish is returning an incorrect content-length header. This error happens across all the caches and various backends.
Varnish is configured to not cache anything without valid cache-control headers and these specific requests are not cached. The client does not send a Accept-Encoding header so Varnish requests a gzipped response from the back end and sends an uncompressed version to the client.
From varnishlog all the backend requests look the same:
BerespStatus 200
BerespReason OK
BerespHeader Content-Encoding: gzip
BerespHeader Content-Length: 1001
BerespHeader Content-Type: application/json
BerespHeader Date: Fri, 08 Feb 2019 14:59:06 GMT
BerespHeader Vary: Accept-Encoding
TTL RFC 0 10 0 1549637947 1549637947 1549637946 0 0 cacheable
VCL_call BACKEND_RESPONSE
TTL VCL 120 10 0 1549637947 cacheable
TTL VCL 120 10 0 1549637947 uncacheable
VCL_return deliver
Filters testgunzip
Storage malloc Transient
Fetch_Body 3 length stream
Gzip u F - 1001 3792 80 7906 7944
BackendReuse 54 default
Length 1001
I believe that the line Gzip u F - 1001 3792 80 7906 7944
is stating that the 1001 byte zipped response is uncompressed to be 3792 bytes.
There are three different types of response. About 50% set the content-length to 3792:
VCL_call RECV
ReqHeader Surrogate-Capability: key=ESI/1.0
VCL_return hash
VCL_call HASH
VCL_return lookup
HitMiss 1052359 119.885211
VCL_call MISS
VCL_return fetch
Link bereq 1052362 fetch
Timestamp Fetch: 1549633459.720553 0.004765 0.004765
RespProtocol HTTP/1.1
RespStatus 200
RespReason OK
RespHeader Content-Encoding: gzip
RespHeader Content-Length: 1000
RespHeader Content-Type: application/json
RespHeader Date: Fri, 08 Feb 2019 13:44:19 GMT
RespHeader Vary: Accept-Encoding
RespHeader Age: 0
RespHeader Via: 1.1 varnish (Varnish/6.1)
VCL_call DELIVER
RespHeader X-Cache: MISS
RespHeader X-Cache-Hits: 0
VCL_return deliver
Timestamp Process: 1549633459.720567 0.004779 0.000014
RespUnset Content-Encoding: gzip
RespHeader Accept-Ranges: bytes
RespUnset Content-Length: 1000
RespHeader Content-Length: 3792
Gzip U D - 1001 3792 80 7906 7944
About 50% without a content-length header but with transfer-encoding chunked instead:
Same as above except
...
RespUnset Content-Length: 1000
RespHeader Transfer-Encoding: chunked
Gzip U D - 1001 3792 80 7906 7944
And the 0.1% that breaks the client because of an incorrect content-length:
...
RespUnset Content-Length: 1001
RespHeader Content-Length: 208
Gzip U D - 1001 3792 80 7906 7944
Does anybody have any thoughts as to the cause of this? Any suggestions for next step in debugging?
My VCL doesn't do anything with content-length headers. I'm running varnish 6.1 (I'm considering dropping down to the stable version 6.0.2 but haven't tried that yet)
EDIT:
Downgraded to 6.0.2 and am still seeing the same issue.
EDIT 2:
I have disabled varnish from upgrading a request to gzip with the command switch -p "http_gzip_support=off"
and that has fixed the issue at the expense of non-zipped requests still being non-zipped when they hit the backend, but that's good enough for now.