1

I've been running into an issue where a Rails app server (nginx/puma) and a PostgreSQL data server communicate consistently when on the same VLAN on our DMZ, yet when the database is isolated to another VLAN and the app server remains on the DMZ, a user hitting the app server only eventually runs into 504 (Gateway Timeout) errors from nginx. These eventual timeouts do not seem related to actual end-user usage of the apps (potential under-allotment of connections, used-up connections etc.) as I have noticed this issue can happen on a weekend, when almost certainly no users are in the system. From the first 504 gateway timeout, all subsequent requests to the server error out with more 504 gateway timeout pages. I would say that this is due to a suboptimal connection configuration on my part, yet when both servers are on the same DMZ and not connecting through a firewall the whole thing works. When the pair is in the "bad" configuration, connections work but only for a variable period of time, usually an hour or so.

Puma configuration is as follows:

#!/usr/bin/env puma

directory "/var/www/my_app/current"
preload_app!
environment "production"
daemonize true
pidfile  "/var/www/my_app/shared/tmp/pids/my_app.pid"
state_path "/var/www/my_app/shared/puma/my_app.state"
stdout_redirect '/var/www/my_app/shared/log/production.log', '/var/www/my_app/shared/log/production_err.log', false
threads 0, 16
bind "unix:///var/www/my_app/shared/tmp/sockets/my_app.sock"
workers 8

on_worker_boot do
  require "active_record"
  ActiveRecord::Base.connection.disconnect! rescue ActiveRecord::ConnectionNotEstablished
  ActiveRecord::Base.establish_connection(YAML.load_file("/var/www/my_app/current/config/database.yml")["production"])
end

before_fork do
  ActiveRecord::Base.connection.disconnect! rescue ActiveRecord::ConnectionNotEstablished
end

Nginx configuration is as follows:

upstream my_app {
server unix:///var/www/my_app/current/tmp/sockets/my_app.sock;
}

server {
        listen 80 default;
        listen [::]:80 default;
        return 301 https://$host$request_uri;
}


server {
        listen 443 ssl default;
        listen [::]:443 ssl default;
        server_name my_server.domain.com;
        add_header Strict-Transport-Security "max-age=31536000; includeSubDomains; preload";

        root /var/www/my_app/current/public;

        ssl_certificate /etc/ssl/certs/my_app_crt;
        ssl_certificate_key /etc/ssl/private/my_app_key;

        ssl_ciphers 'ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384:DHE-RSA-AES128-GCM-SHA256:DHE-DSS-AES128-GCM-SHA256:kEDH+AESGCM:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA:ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA:ECDHE-ECDSA-AES256-SHA:DHE-RSA-AES128-SHA256:DHE-RSA-AES128-SHA:DHE-DSS-AES128-SHA256:DHE-RSA-AES256-SHA256:DHE-DSS-AES256-SHA:DHE-RSA-AES256-SHA:AES128-GCM-SHA256:AES256-GCM-SHA384:AES128-SHA256:AES256-SHA256:AES128-SHA:AES256-SHA:AES:CAMELLIA:DES-CBC3-SHA:!aNULL:!eNULL:!EXPORT:!DES:!RC4:!MD5:!PSK:!aECDH:!EDH-DSS-DES-CBC3-SHA:!EDH-RSA-DES-CBC3-SHA:!KRB5-DES-CBC3-SHA';

        ssl_prefer_server_ciphers on;
        #See https://weakdh.org/
        ssl_dhparam /etc/ssl/private/dhparams.pem;

        client_max_body_size 500M;

        location / {

                if (-f $document_root/maintenance.html) {
                        return 503;
                }

                proxy_pass http://my_app; # match the name of upstream directive which is defined above
                proxy_set_header Host $host;
                proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
                proxy_set_header X-Forwarded-Proto https;
        }

        location ~* ^/assets/ {
                # Per RFC2616 - 1 year maximum expiry
                expires 1y;
                add_header Cache-Control public;

                # Some browsers still send conditional-GET requests if there's a
                # Last-Modified header or an ETag header even if they haven't
                # reached the expiry date sent in the Expires header.
                add_header Last-Modified "";
                add_header ETag "";
                break;
        }

        error_page 503 @maintenance;

        location @maintenance {
                rewrite ^(.*)$ /maintenance.html break;
        }

}

I'm thinking the firewall may be the problem, but we see nothing regarding blocked connections in our Palo Alto firewall. We've tried opening only postgresql traffic, and then broadening to only tcp traffic on port 5432, and the issues persists. The postgres configuration is pretty bog-standard, with a max_connections that outranks the max possible connections that can be made by the app server.

jrkinnard
  • 13
  • 1
  • 3

1 Answers1

0

Just a wild guess, but maybe the Firewall "forgot" about the TCP Session? Many firewalls have a timeout for "unused" TCP Sessions.

When your Rails application starts and connects to the database everything works fine. When there is a longer period of silence between the Rails application and the database server the firewall hits its tcp session timeout and thinks the session was closed while both ends (rails and the database server) both believe it is open. When rails wants to query the database now it will be blocked by the firewall as the packages don't match a known tcp session.

If your make your rails run "select 1" or something like that on a regular schedule the connection shouldn't be dropped anymore.

You can also try to reconfigure postgresql's tcp keepalive behaviour. In postgresql.conf you can set tcp_keepalives_idle = 60
tcp_keepalives_interval = 1
tcp_keepalives_count = 5 This tells the TCP stack to send a keepalive packet every 60 seconds and mark the connection as dead when 5 such packages are lost. The keepalive packet itself should be sufficient to make the firewall keep the connection open.

The default value for tcp_keepalives_idle on Linux should be 7200 which is too high when your firewall discards tcp sessions after 3600 seconds. You may want to tune the kernel via sysctl parameters on all your hosts to make all programs work better with that specific firewall: net.ipv4.tcp_keepalive_time = 3500 This sets the default keepalive time to 3500 seconds (which is somewhat smaller than your firewalls TCP timeout)

Andreas Rogge
  • 2,670
  • 10
  • 24
  • Thank you for your response. Your response is plausible since the symptoms mimic what happens with my environment's connections. When the 504 Gateway Timeout errors are reached, nginx reports a timeout from upstream (from Puma) and postgres reports no data received from client (Puma) as if the connection has been "cloven in two." This could happen if the firewall was culling unused connections. Our palo alto settings show many different TCP session timeout settings, but none specify Unused connections. I'm unsure which is important: https://www.dropbox.com/s/l03kgl63u0kcltp/536AC626.PNG?dl=0 – jrkinnard May 11 '17 at 14:38
  • The setting that should apply in this case is "TCP Timeout" which is set to 3600 seconds or 1 hour. This means that if there is one hour of "silence" between the application server and the database server the open tcp connection would be dropped by the firewall. – Andreas Rogge May 12 '17 at 14:20
  • Thank you! Paring back net.ipv4.tcp_keepalive to 3500 on both the application and database servers has done the trick! – jrkinnard May 23 '17 at 13:58