Slow Failover in Squid Cluster

Question

I created a cluster with Corosync/Pacemaker. In the cluster i've configured two resources, a virtual IP and Squid. There are 2 nodes in the cluster working on Debian 8.

crm_status displays, both nodes are Online and everything works fine.

For testing purposes i stopped node one. CRM shows that the resources have migrated to the 2nd node, but when i use the virtual IP on a clients browser, i get no response. Most times it takes about 10 minutes, until the client is able to browse with the virtual IP over the 2nd node.

I think (hope) it's a small missconfiguration but in this moment i don't know where to locate the failure. Here's my config:

Nodes

Node 1                 Node 2
eth0 10.0.0.234        eth0 10.0.0.235
eth1 x.x.x.134         eth1 x.x.x.135

Virtual IP: 10.0.0.233

Corosync

totem {
    version: 2
    cluster_name: SQUID
    token: 3000
    token_retransmits_before_loss_const: 10
    clear_node_high_bit: yes
    crypto_cipher: aes256
    crypto_hash: sha1

    interface {
            ringnumber: 0
            bindnetaddr: x.x.x.0
            mcastaddr: 239.255.1.1
            mcastport: 5405
            ttl: 1
    }
}
logging {
    fileline: off
    to_stderr: no
    to_logfile: no
    to_syslog: yes
    syslog_facility: daemon
    debug: off
    timestamp: on
    logger_subsys {
            subsys: QUORUM
            debug: off
    }
}
quorum {
    provider: corosync_votequorum
    expected_votes: 2
    two_node: 1
}

Pacemaker

primitive SQUID-IP IPaddr2 \
    params ip=10.0.0.233 cidr_netmask=24 nic=eth0 \
    op monitor interval=30s \
    meta target-role=Started
primitive SQUID-Service Squid \
    params squid_exe="/usr/sbin/squid3" squid_conf="/etc/squid3/squid.conf" squid_pidfile="/run/squid3.pid" squid_port=3128 squid_stop_timeout=10 debug_mode=v debug_log="/var/log/cluster.log" \
    op start interval=0 timeout=60s \
    op stop interval=0 timeout=120s \
    op monitor interval=10s timeout=30s \
    meta target-role=Started
colocation lb-loc inf: SQUID-IP SQUID-Service
order lb-ord inf: SQUID-IP SQUID-Service
property cib-bootstrap-options: \
    have-watchdog=false \
    dc-version=1.1.15-e174ec8 \
    cluster-infrastructure=corosync \
    cluster-name=Squid \
    stonith-enabled=no \
    no-quorum-policy=ignore
rsc_defaults rsc-options: \
    resource-stickiness=200

Squid

#Networks
acl net_client src 192.168.1.0/24
acl net_cus src 10.0.200.0/24

#ACLs
acl SSL_ports port 443
acl Safe_ports port 80          # http
acl Safe_ports port 21          # ftp
acl Safe_ports port 443         # https
acl Safe_ports port 70          # gopher
acl Safe_ports port 210         # wais
acl Safe_ports port 1025-65535  # unregistered ports
acl Safe_ports port 280         # http-mgmt
acl Safe_ports port 488         # gss-http
acl Safe_ports port 591         # filemaker
acl Safe_ports port 777         # multiling http
acl CONNECT method CONNECT

#Rules
http_access deny !Safe_ports
http_access allow net_client
http_access allow net_cus
#http_access deny CONNECT !SSL_ports
http_access allow localhost manager
http_access deny manager
http_access allow localhost
http_access deny all

#Proxy Port
http_port 3128

#Cache Size
cache_mem 512 MB

#Cache Directory
cache_dir ufs /var/spool/squid3 100 16 256

#PID File
pid_filename /var/run/squid3.pid

#Cache Log
cache_log /var/log/squid3/cache.log

#Leave coredumps in the first cache dir
coredump_dir /var/spool/squid3

# Add any of your own refresh_pattern entries above these.
refresh_pattern ^ftp:           1440    20%     10080
refresh_pattern ^gopher:        1440    0%      1440
refresh_pattern -i (/cgi-bin/|\?) 0     0%      0
refresh_pattern .               0       20%     4320

#Notification Address
cache_mgr my@address.com

I'd suggest to configure the `shutdown_lifetime` directive in `squid.conf` to a low value (e.g. 10s, I use 1s, even). Also: check your ARP tables, maybe your client doesn't try to connect to your "new" 2nd cluster-node, but has the ARP entry of the first one. — Lenniey, May 09 '17 at 14:27
Thanks for your answer @Lenniey. I added `shutdown_lifetime 10 seconds`to my `squid.conf`. Also i cleaned the ARP cache on my client (win10) after the Squid resource moved to node 2. Unfortunately the problem exists. I noticed, that i also wasn't able to ping the 2nd node until some magic appears and all services are working again (it took ~15 mins this time) — Cpt.Captain, May 09 '17 at 15:22
Do you have any router / switch between your client and the proxy? I had a similar problem where I had to change the ARP-refresh-interval on my router to get it to work. — Lenniey, May 09 '17 at 15:27
Yep, there is a firewall and exactly this was the problem. Thanks a lot :) — Cpt.Captain, May 10 '17 at 07:08
if you want to speed up the cluster failover, I think that you need to configure the stonith. — c4f4t0r, May 10 '17 at 12:40
If you say you solved the problem doing something in the arp cache, by anyway, pacemaker needs stonith to speedup the failover when a node is in an unknow state, read the clusterlabs docs — c4f4t0r, May 10 '17 at 18:17
@c4f4t0r The original question is asked about a slow manual failover. Of course you should use some STONITH device in a production cluster, but that's out of the scope of this question, I presume. — Lenniey, May 11 '17 at 06:44

score 2 · Accepted Answer · answered May 10 '17 at 10:22

2

The problem was the ARP-cache / refresh interval on an intermediate firewall. After reconfiguring the failover works as intended.

answered May 10 '17 at 10:22

Lenniey

5,090
2
17
28

Slow Failover in Squid Cluster

1 Answers1