I'm trying to configure an active-passive stunnel
setup, with the aid of Keepalived, for a public IP address at our company datacenter. I would like to know if a router or switch reconfiguration is recommended given the following scenario.
I currently have on each of two CentOS 6 boxes a confirmed-working installation of stunnel
, which proxies the connection to a test page on another (internal) server. I based the configuration on this tutorial. (I have redacted the external virtual IP, which exists on the eth1
device, in order to protect the innocent.)
# Box 1 (primary)
vrrp_script chk_stunnel { # Requires keepalived-1.1.13
script "killall -0 stunnel" # cheaper than pidof
interval 2 # check every 2 seconds
weight 2 # add 2 points of prio if OK
}
vrrp_instance stunnel_cluster {
state MASTER
interface eth0
virtual_router_id 51
priority 101
advert_int 1
virtual_ipaddress {
<VIRTUAL IP>/32 dev eth1
}
track_script {
chk_stunnel
}
}
# Box 2 (secondary)
vrrp_script chk_stunnel { # Requires keepalived-1.1.13
script "killall -0 stunnel" # cheaper than pidof
interval 2 # check every 2 seconds
weight 2 # add 2 points of prio if OK
}
vrrp_instance stunnel_cluster {
state BACKUP
interface eth0
virtual_router_id 51
priority 100
advert_int 1
virtual_ipaddress {
<VIRTUAL IP>/32 dev eth1
}
track_script {
chk_stunnel
}
}
When I take down the primary instance of stunnel
, syslogs indicate that the secondary box starts to send gratuitous ARP packets (as anticipated), but the test Web page does not fail over and instead becomes unavailable. After some time (several hours at least), the second box finally takes over.
To me this sounds like ARP caching on at least one of our Juniper devices (a public-facing router and a separate switch). Rather than override the default timeout (which I believe to be six hours), I would prefer to have the gratuitous ARP work the way (I think?) it is supposed to work and trigger a routing table update.
I believe that the gratuitous-arp-reply
setting may prove useful here. Before I make this change on our system, I am hoping that someone will know:
- Have I overlooked a more likely culprit as the source of the failover failure?
- If the ARP caching is the issue, does this change sound like a "standard" way to go about what I am trying to do?
- Assuming a reasonably-secured corporate network perimeter, does changing this setting pose an unreasonable security risk?
- Are there any other "gotchas" I should know?
Thank you.