1

I would not say that I am a total newbie to server administration but obviously I have missed some key moment here...

Problem: Connectivity to the server lost (like lockout via firewall on the server) from particular subnet when a website is accessed from one specific device - Apple iPad (Version 8.4.1 (12H321) Model: MD515HC/A) running Safari.

The connectivity comes back after a short while of ipad inactivity.

If, before the lockout, there is an active SSH connection to the server - the connection is kept all right, but new connections to the server cannot be made (as if all ports are closed).

Iptables input/output policies ar set to ACCEPT. Amazon EC2 has my IP address set up to allow all traffic.

# iptables -L -n
Chain INPUT (policy ACCEPT)
target     prot opt source               destination

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination

Chain f2b-sshd (0 references)
target     prot opt source               destination

Regular log files show absolutely no relevant information.

# apparmor_status
apparmor module is loaded.
1 profiles are loaded.
1 profiles are in enforce mode.
   docker-default
0 profiles are in complain mode.
0 processes have profiles defined.
0 processes are in enforce mode.
0 processes are in complain mode.
0 processes are unconfined but have a profile defined.


# cat /etc/selinux/config
SELINUX=permissive
SELINUXTYPE=targeted
SETLOCALDEFS=0

Webserver running nginx 1.12.1 with php 5.6 and 7.0 Updated from nginx from 1.10 to 1.12.1 - the problem remains.

I doubt that the problem is directly connected to nginx rather than to how it is using system resources.

Instance type currently is Amazon EC2 - t2.micro but the same problem persists on c4.8xlarge

Nothing out of obvious comes out from nginx strace when webpage is accessed from iPad.

Right after connection hangs - Wireshark output on the emitter end:

13713   1413.319083 192.168.8.100   52.57.147.216   TCP 66  54046 → 80 [SYN] Seq=0 Win=64240 Len=0 MSS=1460 WS=256 SACK_PERM=1
...
13750   1422.319314 192.168.8.100   52.57.147.216   TCP 66  [TCP Retransmission] 54046 → 80 [SYN] Seq=0 Win=64240 Len=0 MSS=1460 WS=256 SACK_PERM=1

tcpdump on the server when the connection is allright:

17:10:53.188562 IP (tos 0x0, ttl 111, id 11792, offset 0, flags [DF], proto TCP (6), length 40)
    XXX.XXX.XXX.XXX.55020 > 172.31.12.47.80: Flags [F.], cksum 0x3882 (correct), seq 2232, ack 1211, win 255, length 0
17:10:53.188741 IP (tos 0x0, ttl 111, id 11793, offset 0, flags [DF], proto TCP (6), length 52)
    XXX.XXX.XXX.XXX.55031 > 172.31.12.47.80: Flags [S], cksum 0x111a (correct), seq 2503140615, win 64240, options [mss 1420,nop,wscale 8,nop,nop,sackOK], length 0
17:10:53.249513 IP (tos 0x0, ttl 111, id 11794, offset 0, flags [DF], proto TCP (6), length 40)
    XXX.XXX.XXX.XXX.55031 > 172.31.12.47.80: Flags [.], cksum 0x984a (correct), seq 2503140616, ack 1871922116, win 260, length 0
17:10:53.252631 IP (tos 0x0, ttl 111, id 11795, offset 0, flags [none], proto TCP (6), length 784)
    XXX.XXX.XXX.XXX.55031 > 172.31.12.47.80: Flags [P.], cksum 0x88f8 (correct), seq 0:744, ack 1, win 260, length 744: HTTP, length: 744
        GET /wtf/2.htm HTTP/1.1
        Host: www....lv
        Connection: keep-alive
        Pragma: no-cache
        Cache-Control: no-cache
        Upgrade-Insecure-Requests: 1
        User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36
        Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
        Accept-Encoding: gzip, deflate
        Accept-Language: en-US,en;q=0.8
        Cookie: PHPSESSID=0hospgqmearo59saf20cfv7tt3; _hjIncludedInSample=1; _ga=GA1.2.2014778432.1499196989; _gid=GA1.2.833813339.1507378818
        x-tele2-subid: XXX.XXX.XXX.XXX

17:10:53.359526 IP (tos 0x0, ttl 111, id 11796, offset 0, flags [DF], proto TCP (6), length 40)
    XXX.XXX.XXX.XXX.55031 > 172.31.12.47.80: Flags [.], cksum 0x93d0 (correct), seq 744, ack 404, win 259, length 0

tcpdump on the server right after connections are dropped:

17:11:19.181562 IP (tos 0x0, ttl 47, id 38570, offset 0, flags [DF], proto TCP (6), length 52)
    XXX.XXX.XXX.XXX.51273 > 172.31.12.47.80: Flags [.], cksum 0xf058 (correct), seq 1157, ack 199, win 4129, options [nop,nop,TS val 323793256 ecr 4027542545], length 0
17:11:19.251976 IP (tos 0x0, ttl 47, id 8939, offset 0, flags [DF], proto TCP (6), length 52)
    XXX.XXX.XXX.XXX.51274 > 172.31.12.47.80: Flags [.], cksum 0x711b (correct), seq 1158, ack 198, win 4129, options [nop,nop,TS val 323793326 ecr 4027542547], length 0
17:11:20.212575 IP (tos 0x0, ttl 111, id 11804, offset 0, flags [DF], proto TCP (6), length 40)
    XXX.XXX.XXX.XXX.55058 > 172.31.12.47.80: Flags [F.], cksum 0x3b77 (correct), seq 744, ack 405, win 259, length 0
17:11:20.212839 IP (tos 0x0, ttl 111, id 11805, offset 0, flags [DF], proto TCP (6), length 52)
    XXX.XXX.XXX.XXX.55069 > 172.31.12.47.80: Flags [S], cksum 0xc9cb (correct), seq 4012888626, win 64240, options [mss 1420,nop,wscale 8,nop,nop,sackOK], length 0
17:11:20.459739 IP (tos 0x0, ttl 111, id 11806, offset 0, flags [DF], proto TCP (6), length 52)
    XXX.XXX.XXX.XXX.55070 > 172.31.12.47.80: Flags [S], cksum 0xd787 (correct), seq 1916158319, win 64240, options [mss 1420,nop,wscale 8,nop,nop,sackOK], length 0
17:11:21.219597 IP (tos 0x0, ttl 47, id 25897, offset 0, flags [DF], proto TCP (6), length 52)
    XXX.XXX.XXX.XXX.51272 > 172.31.12.47.80: Flags [.], cksum 0x4702 (correct), seq 2220, ack 2185, win 4096, options [nop,nop,TS val 323795291 ecr 4027543025], length 0
17:11:21.221524 IP (tos 0x0, ttl 47, id 12413, offset 0, flags [DF], proto TCP (6), length 52)
    XXX.XXX.XXX.XXX.51273 > 172.31.12.47.80: Flags [.], cksum 0xe66f (correct), seq 1157, ack 200, win 4129, options [nop,nop,TS val 323795291 ecr 4027543046], length 0
17:11:21.221548 IP (tos 0x0, ttl 47, id 40941, offset 0, flags [DF], proto TCP (6), length 52)
    XXX.XXX.XXX.XXX.51274 > 172.31.12.47.80: Flags [.], cksum 0x6779 (correct), seq 1158, ack 199, win 4129, options [nop,nop,TS val 323795291 ecr 4027543047], length 0
17:11:22.010619 IP (tos 0x0, ttl 47, id 20698, offset 0, flags [DF], proto TCP (6), length 52)
    XXX.XXX.XXX.XXX.51272 > 172.31.12.47.80: Flags [F.], cksum 0x43ff (correct), seq 2220, ack 2185, win 4096, options [nop,nop,TS val 323796061 ecr 4027543025], length 0
17:11:22.010687 IP (tos 0x0, ttl 47, id 21278, offset 0, flags [DF], proto TCP (6), length 52)
    XXX.XXX.XXX.XXX.51273 > 172.31.12.47.80: Flags [F.], cksum 0xe36c (correct), seq 1157, ack 200, win 4129, options [nop,nop,TS val 323796061 ecr 4027543046], length 0
17:11:22.010780 IP (tos 0x0, ttl 47, id 37726, offset 0, flags [DF], proto TCP (6), length 52)
    XXX.XXX.XXX.XXX.51274 > 172.31.12.47.80: Flags [F.], cksum 0x6477 (correct), seq 1158, ack 199, win 4129, options [nop,nop,TS val 323796060 ecr 4027543047], length 0
17:11:22.391572 IP (tos 0x0, ttl 47, id 30595, offset 0, flags [DF], proto TCP (6), length 52)
    XXX.XXX.XXX.XXX.51273 > 172.31.12.47.80: Flags [F.], cksum 0xe208 (correct), seq 1125, ack 200, win 4129, options [nop,nop,TS val 323796449 ecr 4027543046], length 0
17:11:22.462590 IP (tos 0x0, ttl 47, id 9929, offset 0, flags [DF], proto TCP (6), length 40)
    XXX.XXX.XXX.XXX.51273 > 172.31.12.47.80: Flags [R], cksum 0xf229 (correct), seq 3704030890, win 0, length 0
17:11:23.201564 IP (tos 0x0, ttl 111, id 11807, offset 0, flags [DF], proto TCP (6), length 52)
    XXX.XXX.XXX.XXX.55069 > 172.31.12.47.80: Flags [S], cksum 0xc9cb (correct), seq 4012888626, win 64240, options [mss 1420,nop,wscale 8,nop,nop,sackOK], length 0
17:11:23.459562 IP (tos 0x0, ttl 111, id 11808, offset 0, flags [DF], proto TCP (6), length 52)
    XXX.XXX.XXX.XXX.55070 > 172.31.12.47.80: Flags [S], cksum 0xd787 (correct), seq 1916158319, win 64240, options [mss 1420,nop,wscale 8,nop,nop,sackOK], length 0

Sometimes it takes 2 or 3 page refresh on the iPad to crash connection. Connections are dropped for ~2 - 5 minutes and then everything returns back to normal ...until the iPad is used.

Any hint on how to track down this issue is highly appreciated. To be honest - I am out of ideas...

Update #1

# sysctl -p
net.ipv4.ip_forward = 1
fs.file-max = 65536
net.ipv4.conf.all.rp_filter = 1
net.ipv4.tcp_synack_retries = 2
net.ipv4.ip_local_port_range = 2000 65535
net.ipv4.tcp_rfc1337 = 1
net.ipv4.tcp_fin_timeout = 15
net.ipv4.tcp_keepalive_time = 300
net.ipv4.tcp_keepalive_probes = 5
net.ipv4.tcp_keepalive_intvl = 15
net.core.rmem_default = 31457280
net.core.rmem_max = 12582912
net.core.wmem_default = 31457280
net.core.wmem_max = 12582912
net.core.somaxconn = 4096
net.core.netdev_max_backlog = 65536
net.core.optmem_max = 25165824
net.ipv4.tcp_mem = 65536 131072 262144
net.ipv4.udp_mem = 65536 131072 262144
net.ipv4.tcp_rmem = 8192 87380 16777216
net.ipv4.udp_rmem_min = 16384
net.ipv4.tcp_wmem = 8192 65536 16777216
net.ipv4.udp_wmem_min = 16384
net.ipv4.tcp_max_tw_buckets = 1440000
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_tw_reuse = 1

Update 2 as requested nginx.conf

user  www-data;
#worker_processes  8; worker_processes 1;
error_log  /var/log/nginx/error.log warn; 
pid        /var/run/nginx.pid;
events {
    worker_connections  1024;
    multi_accept on;
    use epoll; 
}
worker_rlimit_nofile 65536;

http {
        include       /etc/nginx/mime.types;
        default_type  application/octet-stream;

    log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                      '$status $body_bytes_sent "$http_referer" '
                      '"$http_user_agent" "$http_x_forwarded_for"';

    log_format scripts '$document_root$fastcgi_script_name > $request';
    access_log /var/log/nginx/access.log main;
    server_tokens off;
    sendfile        off;
    tcp_nopush     on;
    tcp_nodelay     on;
    client_max_body_size 400M;
    client_body_buffer_size 1m;
    client_header_timeout 15;

    keepalive_timeout  2 2;
#    open_file_cache          max=10000 inactive=5m;
#    open_file_cache_valid    2m;
#    open_file_cache_min_uses 5;
#    open_file_cache_errors   off;
    send_timeout 15;


    fastcgi_max_temp_file_size 0;

    gzip on;
    gzip_disable "msie6";

    gzip_vary on;
    gzip_proxied any;
    gzip_comp_level 6;
    gzip_buffers 16 8k;
    gzip_http_version 1.1;
    gzip_types text/plain text/css application/json application/x-javascript text/xml application/xml application/xml+rss text/javascripti application/javascript;

    server {
        listen       80  default_server;
        server_name  _;
        return       444;
    include /etc/nginx/sites-enabled/*; 

}

Virtualhost config (does not matter - all hosts are affected. When any vhost is accessed from the iPad, connection to server gets frozen)

server {
    listen 80;
    listen 443 ssl http2;
    server_name ds.somehost.lv;
    root "/www/ds.somehost.lv/html/public";

    index index.html index.htm index.php;

    charset utf-8;

    location / {
        try_files $uri $uri/ /index.php?$query_string;
    }

    location = /favicon.ico { access_log off; log_not_found off; }
    location = /robots.txt  { access_log off; log_not_found off; }

    error_log  /var/log/nginx/ds.somehost.app-error.log error;

    sendfile off;

    client_max_body_size 1000m;

    location ~ \.php$ {
        fastcgi_split_path_info ^(.+\.php)(/.+)$;
        fastcgi_pass unix:/var/run/php/php7.0-fpm.sock;
        fastcgi_index index.php;
        include fastcgi_params;
        fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;


        fastcgi_intercept_errors off;
        fastcgi_buffer_size 16k;
        fastcgi_buffers 4 16k;
        fastcgi_connect_timeout 300;
        fastcgi_send_timeout 300;
        fastcgi_read_timeout 300;
    }

    location ~ /\.ht {
        deny all;
    }

}

access.log: Requests from windows machine:

XXX.XXX.XXX.XXX - - [08/Oct/2017:23:53:49 +0300] "GET /?asdfasd=asdfasd HTTP/1.1" 200 5430 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36" "-"
XXX.XXX.XXX.XXX - - [08/Oct/2017:23:53:52 +0300] "GET /?asdfasd=asdfasd HTTP/1.1" 200 5430 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36" "-"
XXX.XXX.XXX.XXX - - [08/Oct/2017:23:53:54 +0300] "GET /?asdfasd=asdfasd HTTP/1.1" 200 5430 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36" "-"

Request from iPad

XXX.XXX.XXX.XXX - - [08/Oct/2017:23:53:57 +0300] "GET /login HTTP/1.1" 200 967 "-" "Mozilla/5.0 (iPad; CPU OS 8_4_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12H321 Safari/600.1.4" "-"
XXX.XXX.XXX.XXX - - [08/Oct/2017:23:53:57 +0300] "GET /css/bootstrap.min.css HTTP/1.1" 304 0 "http://ds.somehost.lv/login" "Mozilla/5.0 (iPad; CPU OS 8_4_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12H321 Safari/600.1.4" "-"
XXX.XXX.XXX.XXX - - [08/Oct/2017:23:53:57 +0300] "GET /css/gentelella.min.css HTTP/1.1" 304 0 "http://ds.somehost.lv/login" "Mozilla/5.0 (iPad; CPU OS 8_4_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12H321 Safari/600.1.4" "-"
XXX.XXX.XXX.XXX - - [08/Oct/2017:23:53:57 +0300] "GET /css/font-awesome.min.css HTTP/1.1" 304 0 "http://ds.somehost.lv/login" "Mozilla/5.0 (iPad; CPU OS 8_4_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12H321 Safari/600.1.4" "-"

Here I am trying to connect to the same page from windows machine (request times out)

Trying to refresh the page from iPad - request instantly satisfied

XXX.XXX.XXX.XXX - - [08/Oct/2017:23:54:08 +0300] "GET /login HTTP/1.1" 200 967 "-" "Mozilla/5.0 (iPad; CPU OS 8_4_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12H321 Safari/600.1.4" "-"
XXX.XXX.XXX.XXX - - [08/Oct/2017:23:54:09 +0300] "GET /css/bootstrap.min.css HTTP/1.1" 304 0 "http://ds.somehost.lv/login" "Mozilla/5.0 (iPad; CPU OS 8_4_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12H321 Safari/600.1.4" "-"
XXX.XXX.XXX.XXX - - [08/Oct/2017:23:54:09 +0300] "GET /css/font-awesome.min.css HTTP/1.1" 304 0 "http://ds.somehost.lv/login" "Mozilla/5.0 (iPad; CPU OS 8_4_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12H321 Safari/600.1.4" "-"
XXX.XXX.XXX.XXX - - [08/Oct/2017:23:54:09 +0300] "GET /css/gentelella.min.css HTTP/1.1" 304 0 "http://ds.somehost.lv/login" "Mozilla/5.0 (iPad; CPU OS 8_4_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12H321 Safari/600.1.4" "-"

There are no errors logged within error log (syslog, kernel.log, nginx error log).

Update 3 Turns out, new connections are blocked for exactly 60 seconds sharp.

Didzis
  • 121
  • 3
  • Just to confirm I understand this - when you use your iPad on your home (or work?) WiFi connection all connections to your server are dropped? – Tim Oct 08 '17 at 19:02
  • Correct - all new connections from the same network to the target server are dropped. Target server is still accessible from other networks. – Didzis Oct 08 '17 at 19:28
  • Please add your Nginx / server block configuration and, if appropriate, logs if they help demonstrate the problem. – Tim Oct 08 '17 at 20:14
  • Intuition suggests that this problem is broken firewall or router behavior on the source network. Your packet captures on the server do not seem to be capturing the response packets in the success scenario, so the fact that they are absent from the failure scenario seems inconclusive, and suggests that you should retry the captures. – Michael - sqlbot Oct 08 '17 at 20:16
  • @Michael-sqlbot I thought so too, but the source network has been changed a few times. Both - iPad and laptop connected to mobile hotspot - behavior is the same. iPad and laptop connected to 4g router of service provider Nr1 - the same, another router from yet another service provider - results are the same. After this I have ruled out that there might be a routing/antivirus/whatever problem on the source network. I could be wrong of course. – Didzis Oct 08 '17 at 21:04

1 Answers1

0

Turns out the problem had to be reduced to a low level networking where a SYN packet is being sent and no response is provided.

This link here Why would a server not send a SYN/ACK packet in response to a SYN packet pointed me in the right direction.

By turning off tcp_timestamps in sysctl I managed to bypass the problem described initially. But the real reason for this behavior was tcp_tw_recycle setting which was enabled for some reason!

tcp_tw_recycle (Boolean; default: disabled; since Linux 2.4)
          Enable fast recycling of TIME_WAIT sockets.  Enabling this
          option is not recommended for devices communicating with the
          general Internet or using NAT (Network Address Translation).
          Since some NAT gateways pass through IP timestamp values, one
          IP can appear to have non-increasing timestamps.  See RFC 1323
          (PAWS), RFC 6191.

And here is a great writeup to make it stick. https://vincent.bernat.im/en/blog/2014-tcp-time-wait-state-linux

Am I feeling a bit stupid now? Yes

Hands off kernel settings? Definitely!

Didzis
  • 121
  • 3