1

I need to build a VPN connection between a network and a Kubernetes-cluster, so the applications hosted in this in this network could address to K8S-services via a secured tunnel.

So, I have a bunch of K8S-nodes in a self-hosted environment. I've added a separate server to this environment, this server works as a VPN gateway, it's connected to the same VLAN which the cluster nodes are connected to. The nodes have the following IP-addresses: 10.13.17.1/22, 10.13.17.2/22, 10.13.17.3/22 and so on. The VPN gateway has 10.13.16.253/22.

The Cluster IP CIDR is 10.233.0.0/18, the pod IP CIDR is 10.233.64.0/18.

The VPN-server supports an IPSec site-to-site connection with a remote network, 10.103.103.0/24. I use Calico as the networking manager, so I've set up my VPN server to keep BGP-sessions with all K8S-nodes. The VPN server's route table is full of prefixes announced by Calico nodes (10.233.0.0/18 is present too as well, of course), the cluster nodes have 10.103.103.0/24 and some other networks in their route tables, so BGP seems to be working fine. So far so good...

When I establish a connection to a service inside of the cluster from the VPN-server, everything is good too. The client (10.13.16.253) sends a SYN-packet to the service (10.233.10.101:1337), the worker receives this packet, changes it's destination IP-address to the IP-address of the pod (10.233.103.49:1337) and changes it's source IP-address to some IP-address (10.233.110.0) that will help the worker to receive the reply and give it back to the connection initiator. Here's what happens on the worker that receives this SYN-packet. The SYN-packet comes to a worker:

22:04:25.866546 IP 10.13.16.253.56297 > 10.233.10.101.1337: Flags [S], seq 3575679444, win 65228, options [mss 1460,nop,wscale 7,sackOK,TS val 1385938010 ecr 0], length 0

The SYN-packed is being SNATed and DNATed and then it's being sent to the worker where the pod is running:

22:04:25.866656 IP 10.233.110.0.54430 > 10.233.103.49.1337: Flags [S], seq 3575679444, win 65228, options [mss 1460,nop,wscale 7,sackOK,TS val 1385938010 ecr 0], length 0

The reply has came:

22:04:25.867313 IP 10.233.103.49.1337 > 10.233.110.0.54430: Flags [S.], seq 2017844946, ack 3575679445, win 28960, options [mss 1460,sackOK,TS val 1201488363 ecr 1385938010,nop,wscale 7], length 0

The reply is being deSNATed and deDNATed to be sent to the connection initiator:

22:04:25.867533 IP 10.233.10.101.1337 > 10.13.16.253.56297: Flags [S.], seq 2017844946, ack 3575679445, win 28960, options [mss 1460,sackOK,TS val 1201488363 ecr 1385938010,nop,wscale 7], length 0

So, the connection is established and everyone is happy.

But when I try to connect to the same service from the external network (10.103.103.0/24) the worker who receives the SYN-packet does NOT change the source IP-address, it changes the destination IP-address only, so the packet's source IP-address is unchanged. The SYN packet comes to a worker

21:56:05.794171 IP 10.103.103.1.52132 > 10.233.10.101.1337: Flags [S], seq 3759345254, win 29200, options [mss 1460,sackOK,TS val 195801472 ecr 0,nop,wscale 7], length 0

The SYN packet is being DNATed and being resent to the worker where the pod is running

21:56:05.794242 IP 10.103.103.1.52132 > 10.233.103.49.1337: Flags [S], seq 3759345254, win 29200, options [mss 1460,sackOK,TS val 195801472 ecr 0,nop,wscale 7], length 0

And nothing comes back in reply. :-(

So, I see that the destination IP-address is changed, so I can see these packets on the worker where the pod is running, but there are no replies to them:

21:56:05.794602 IP 10.103.103.1.52132 > 10.233.103.49.1337: Flags [S], seq 3759345254, win 29200, options [mss 1460,sackOK,TS val 195801472 ecr 0,nop,wscale 7], length 0

The external network (10.103.103.0/24) is being advertised by the VPN server via BGP, so all the workers know that this network is accessible via 10.13.16.253. When I run the ping-test from a host in the external network (10.103.103.1) to the IP-address of the service (10.233.10.101), the test passes, VPN works fine and routing tables seem to be correct.

So, why does the network "trust" to 10.13.16.253 and doesn't trust to 10.103.103.1? And why does the worker perform SNAT and DNAT for the packets from 10.13.16.253 and does not perform SNAT for the packets from 10.103.103.1? Should I add some policies to allow this traffic?

Thanks in advance for any clues!

Volodymyr Melnyk
  • 537
  • 5
  • 18
  • note: when testing from 10.103.103.1, the re-routing between Service and Pod IP does not seem to translate source IP, as it did when you were testing from the VPN server. Is your VPN server serving routes in BGP? How would your k8s node reach 10.103.103/24? – SYN Feb 15 '21 at 15:52
  • Thank you for your attention. BGP seems to be working fine, all the nodes have correct routes to `10.103.103.0/24`, e.g. `10.103.103.0 via 10.13.16.253 dev eth0 src 10.13.17.1`. These routes appear when I enable BGP and disappear when I disable it. – Volodymyr Melnyk Feb 15 '21 at 18:41
  • Sounds good. Could you elaborate on how calico is peered to your VPN server. Full mesh, which objects have you created? Same AS? Some firewall enabled on your VPN gateway? Might help to set specific AS for each node, if not already done. Make sure your VPN server won't drop /32 prefixes, ... Or try only peering with one node as a starting point, see if that would help? – SYN Feb 16 '21 at 11:30
  • Yes, Calico peers work as a full mesh network. I've created only the `bgpPeer` object, that's how it looks like: `router1 10.13.16.253 (global) 35409`. – Volodymyr Melnyk Feb 16 '21 at 11:43
  • Calico peers belong to `64512`, the ToR peer belong to my real AS (`35409`). Does they have to belong to the same AS? – Volodymyr Melnyk Feb 16 '21 at 11:43
  • Firewall on the VPN server doesn't drop any packets, I watch `tcpdump` and I observe the incoming packets on the nodes, but I don't see the replies that are supposed to be sent. So, I'm 100% sure that couldn't be the VPN server's firewall. – Volodymyr Melnyk Feb 16 '21 at 11:45
  • different AS is good. single bgppeer object should be fine. we're sure statefull firewalling isn't involved, ... in your first follow-up, the route you quote doesn't include a prefix: is that normal/a typo? Are we certain you can reach 10.103.103.0/24 from your k8s nodes? With tcpdump, you don't see the ACK being lost anywhere (wild guess: default gateway) ? Nothing out of the ordinary in calico Pod logs? Looks good otherwise ... – SYN Feb 16 '21 at 12:17
  • Alas, that's just because I checked it with `ip route get 10.103.103.0`. If I check it with `ip route get 10.103.103.1`, I have the following: `10.103.103.1 via 10.13.16.253 dev eth0 src 10.13.17.2`. It would be so nice to realize that I was advertizing only `10.103.103.0/32` instead of `10.103.103.0/24`, but - no. Alas, no. :-( – Volodymyr Melnyk Feb 16 '21 at 14:46
  • And, no, I don't see an ACK anywhere. I run `tcpdump -n -i any` on the worker who receives the initial SYN packet, but there is no ACK. :-( – Volodymyr Melnyk Feb 16 '21 at 14:47

1 Answers1

2

Ta-damn!

pfSense was breaking the SYN-packet's checksum:

13:53:32.286601 IP (tos 0x0, ttl 62, id 33830, offset 0, flags [DF], proto TCP (6), length 60)
    10.103.103.1.47390 > 10.233.10.101.1337: Flags [S], cksum 0x86e4 (incorrect -> 0x99db), seq 4230752647, win 29200, options [mss 1460,sackOK,TS val 598846881 ecr 0,nop,wscale 7], length 0
        0x0000:  4500 003c 8426 4000 3e06 31e0 0a67 6701  E..<.&@.>.1..gg.
        0x0010:  0ae9 0a65 b91e 0539 fc2c 2987 0000 0000  ...e...9.,).....
        0x0020:  a002 7210 86e4 0000 0204 05b4 0402 080a  ..r.............
        0x0030:  23b1 ada1 0000 0000 0103 0307            #...........

I've disabled the hardware checksum offload feature and now everything works smoothly.

Lots of thanks to y'all for your time and attention!

Volodymyr Melnyk
  • 537
  • 5
  • 18