Keepalived, Junos, and ARP caching

Question

I'm trying to configure an active-passive stunnel setup, with the aid of Keepalived, for a public IP address at our company datacenter. I would like to know if a router or switch reconfiguration is recommended given the following scenario.

I currently have on each of two CentOS 6 boxes a confirmed-working installation of stunnel, which proxies the connection to a test page on another (internal) server. I based the configuration on this tutorial. (I have redacted the external virtual IP, which exists on the eth1 device, in order to protect the innocent.)

# Box 1 (primary)
vrrp_script chk_stunnel {           # Requires keepalived-1.1.13
        script "killall -0 stunnel"     # cheaper than pidof
        interval 2                      # check every 2 seconds
        weight 2                        # add 2 points of prio if OK
}

vrrp_instance stunnel_cluster {
    state MASTER
    interface eth0
    virtual_router_id 51
    priority 101
    advert_int 1
    virtual_ipaddress {
        <VIRTUAL IP>/32 dev eth1
    }
    track_script {
        chk_stunnel
    }
}

# Box 2 (secondary)
vrrp_script chk_stunnel {           # Requires keepalived-1.1.13
        script "killall -0 stunnel"     # cheaper than pidof
        interval 2                      # check every 2 seconds
        weight 2                        # add 2 points of prio if OK
}

vrrp_instance stunnel_cluster {
    state BACKUP
    interface eth0
    virtual_router_id 51
    priority 100
    advert_int 1
    virtual_ipaddress {
        <VIRTUAL IP>/32 dev eth1
    }
    track_script {
        chk_stunnel
    }
}

When I take down the primary instance of stunnel, syslogs indicate that the secondary box starts to send gratuitous ARP packets (as anticipated), but the test Web page does not fail over and instead becomes unavailable. After some time (several hours at least), the second box finally takes over.

To me this sounds like ARP caching on at least one of our Juniper devices (a public-facing router and a separate switch). Rather than override the default timeout (which I believe to be six hours), I would prefer to have the gratuitous ARP work the way (I think?) it is supposed to work and trigger a routing table update.

I believe that the gratuitous-arp-reply setting may prove useful here. Before I make this change on our system, I am hoping that someone will know:

Have I overlooked a more likely culprit as the source of the failover failure?
If the ARP caching is the issue, does this change sound like a "standard" way to go about what I am trying to do?
Assuming a reasonably-secured corporate network perimeter, does changing this setting pose an unreasonable security risk?
Are there any other "gotchas" I should know?

Thank you.

Keepalived, Junos, and ARP caching

0 Answers0