Rancher Server Setup
- Rancher version: 2.6.3
- Installation option (Docker install/Helm Chart): Helm Chart, Kubernetes v1.21.6 and RKE1
Information about the Cluster Kubernetes version: v1.20.15-rancher1-2 Cluster Type (Local/Downstream): Downstream If downstream, what type of cluster? (Custom/Imported or specify provider for Hosted/Infrastructure Provider): RKE Custom (3 nodes on-prem + 1 node on Azure)
User Information What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom) Admin role
Describe the bug
To illustrate the inter-pod communication problem, consider these three dcgm-exporter pods that collect and expose GPU metrics :
URL1- http://10.42.0.79:9400/metrics -> Pod 10.42.4.54 running on node-1-on-prem
URL2- http://10.42.2.77:9400/metrics -> Pod 10.42.2.77 running on node-2-on-prem
URL3- http://10.42.4.54:9400/metrics -> Pod 10.42.4.54 running on node-3-azure
On node-1-on-prem Linux shell : curl URL1 & URL2 are successful; curl URL3 fails
On node-2-on-prem Linux shell : curl URL1 & URL2 are successful; curl URL3 fails
On node-3-azure Linux shell : curl URL1 & URL2 fail ; curl URL3 is successful
Reproduce
- On-prem subnet is 10.133.100.0/24 and Azure subnet is 10.208.2.0/24
- Azure Virtual network and Local network are connected by a site to site VPN
- Node to node connections are successful and there are no port restrictions in Azure and on-prem
- IPv4 port forwarding enabled on all nodes
- Downstream cluster container network interface configuration : network: mtu: 0 options: flannel_backend_type: vxlan plugin: canal
- Azure node addition to cluster is flawless and all pods come up
Result
- On node-1-on-prem Linux shell : $curl http://10.42.4.54:9400/metrics curl: (28) Failed to connect to 10.42.4.54 port 9400: Connection timed out
Expected Result
- Successful inter-pod communication and display of GPU metrics
How to get these pods to communicate properly? Thanks in advance for your support.