I am having issues with load balancing UDP Syslog to my Graylog cluster nodes. At first everything seemed to work normal but it seems that traffic is flowing for 99% to one of the two nodes.
I have two Ubuntu servers (18.04) running Keepalived 1.3.9. They share the virtual IP that is shared via VRRP. They are using NAT to forward the traffic to the real servers based on round robin.
global_defs {
notification_email {
redacted@mail
}
notification_email_from severname-redacted
smtp_server mailsever.redacted
smtp_connect_timeout 30
router_id servername
}
vrrp_instance VI_1 {
state MASTER
interface ens160
virtual_router_id 216
priority 200
advert_int 1
preempt_delay 30
virtual_ipaddress {
10.18.242.216
}
notify /usr/local/bin/vrrp_state.sh
}
virtual_server 10.18.242.216 10514 {
delay_loop 2
protocol UDP
lb_algo rr # round robin
lb_kind NAT # NAT
real_server 10.18.242.214 10514 {
weight 1
HTTP_GET {
url {
path "/api/system/lbstatus"
status_code 200
}
connect_timeout 3
connect_port 9000
}
}
real_server 10.18.242.213 10514 {
weight 1
HTTP_GET {
url {
path "/api/system/lbstatus"
status_code 200
}
connect_timeout 3
connect_port 9000
}
}
}
The secondary load balancer is using the same configuration, except the priority which is 100.
Failover between the load balancers is working as expected, but they both seem to forward the traffic only to the first Graylog node:
oot@redacted-lb1:~# ipvsadm -L -n --rate
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port CPS InPPS OutPPS InBPS OutBPS
-> RemoteAddress:Port
UDP 10.18.242.216:10514 0 57 0 16581 0
-> 10.18.242.213:10514 0 67 0 19666 0
-> 10.18.242.214:10514 0 0 0 0 0
As you can see there is no traffic to the secondary Graylog node, even though the weight is equal and we use round robin. Some troubleshooting that did not work:
- Removing the first node from the load balancers, you see the traffic still arriving on the LB but it is not forwarded to the Graylog node
- Changing the weight doesn't seem to have an effect
- Rebooting all servers
- Doing all of the same tests on the secondary LB by shutting down LB1.
The Graylog nodes are both working fine and are almost identical in configuration. You can send the syslog to both of them directly so they do not seem to be the problem.