I just spent a solid 8 hours tracking this down and I still don't quite understand what's going on, so I'm hoping someone can explain this to me or confirm it is a bug so I can open a ticket with the Apache devs to let them know.
I have an Ubuntu 20.04.2 LTS / Apache 2.4.41 / mod_wsgi 4.6.8 / Django 3.1 webserver that recently started having some intermittent chunk encoding issues. I was receiving "ERR_INCOMPLETE_CHUNKED_ENCODING" in the Chrome dev console. It only effected a particularly large .js static file (~220Kb), so I took that as the explanation for why only that file was effected, and tried all the known fixes I could find.
The most promising was disabling gzip encoding for that file, which at least changed the error, but then I just got "ERR_CONTENT_LENGTH_MISMATCH" instead. At that point I realized the amount of the .js file in question that was getting downloaded varied between request attempts. Using wireshark to track the packets, it seems the server just stopped responding early sometimes, so my client closed the connection and reported the content length mismatch. Every once in a blue moon the whole .js file would download properly, but most times it didn't and the page wouldn't load properly without it.
So to skip to the point, I finally discovered that disabling the default server Timeout seemed to be the culprit. I had set:
Timeout -1
...in apache2.conf to disable this timeout to address an earlier issue (yes this isn't necessarily the best solution from a security standpoint, but this is an internal-only system and it was the most efficient fix at the time). Setting this back to any regular value (e.g. 300 as is default), fixed the chunk encoding issue.
I have no idea why this fixed my issue, so I figured I'd see if anyone here knows why this could have caused the behavior I saw - otherwise I'll open a bug ticket with Apache to see if I can't save someone else the trouble I went through.
The only thing I can think of is if there is a separate timeout setting for communicating chunk encoding and the -1
alias to disable the global timeout somehow effects that lower-level timeout improperly? Anyway - any direction is appreciated.
Thanks.